python3webspider / proxypool Goto Github PK

View Code? Open in Web Editor NEW

5.5K 120.0 2.0K 941 KB

An Efficient ProxyPool with Getter, Tester and Server

Home Page: https://proxypool.scrape.center

License: MIT License

Python 95.92% Dockerfile 1.21% Mustache 2.73% Shell 0.15%

proxypool redis http flask proxy webspider

proxypool's People

Stargazers

Watchers

Forkers

zsweet xianggithubli scalershare bonaba kof0012 aegeansea quincyc379 okakaino sanliang666 silverbooker keepgogoing huochequan kuaixuesoft anwzx isxuanxuan jinyun lvsoso xca117 middlexu criskt tim1999 fore-stack wukainf beriwan mayfool cowry5 mimizhang barktegh python0925 iamwyh yang-xinhui kwff pythonzm wangxu98 landihua auditore8 panziqiang007 toyourheart163 sportzhang fly2fire keyrenelu avispeng vanella farolding fon-khan sondeer oneisking lgravity yy189 rynder waitingfy feigong12 luyichengg cocktailpy anoshop appleandpearwow 631068264 jmd110 shidadao gold-cfx jealousljl junqiangle doraemon1 phantom3389 victsww wangchaooo yohee2015 hyfgreg xinihe hawu0616 jesszen baozhong1010 phychaos 86839858 relei1234 kainchow echolvan gonewithgt harryzj timelistener billxq lensv adamhk01 tec-yao mayi140611 luckygl linhuiyangcdns qithird wudangqibujie damon-zln zer0fire zhuyoucai168 ficuser jayden-z cg110778 michaelzyy leolianger pyyourdaye thee225 yangzhaoyunfei

proxypool's Issues

AttributeError: 'OutStream' object has no attribute 'buffer'

#启动代理池

from proxypool.scheduler import Scheduler
import sys
import io

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

def main():
try:
s = Scheduler()
s.run()
except:
main()

if name == 'main':
main()

AttributeError: 'OutStream' object has no attribute 'buffer'

项目部署到Linux直接报错了

代理池开始运行

Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
Debug mode: off
Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
代理池开始运行
Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
Debug mode: off
Process Process-3:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 35, in schedule_api
app.run(API_HOST, API_PORT)
File "/usr/local/python3/lib/python3.6/site-packages/flask/app.py", line 990, in run
run_simple(host, port, self, **options)
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 1009, in run_simple
inner()
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 962, in inner
fd=fd,
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 805, in make_server
host, port, app, request_handler, passthrough_errors, ssl_context, fd=fd
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 698, in init
HTTPServer.init(self, server_address, handler)
File "/usr/local/python3/lib/python3.6/socketserver.py", line 453, in init
self.server_bind()
File "/usr/local/python3/lib/python3.6/http/server.py", line 136, in server_bind
socketserver.TCPServer.server_bind(self)
File "/usr/local/python3/lib/python3.6/socketserver.py", line 467, in server_bind
self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use
代理池开始运行
Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
Debug mode: off
Process Process-3:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 35, in schedule_api
app.run(API_HOST, API_PORT)
File "/usr/local/python3/lib/python3.6/site-packages/flask/app.py", line 990, in run
run_simple(host, port, self, **options)
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 1009, in run_simple
inner()
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 962, in inner
fd=fd,
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 805, in make_server
host, port, app, request_handler, passthrough_errors, ssl_context, fd=fd
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 698, in init
HTTPServer.init(self, server_address, handler)
File "/usr/local/python3/lib/python3.6/socketserver.py", line 453, in init
self.server_bind()
File "/usr/local/python3/lib/python3.6/http/server.py", line 136, in server_bind
socketserver.TCPServer.server_bind(self)
File "/usr/local/python3/lib/python3.6/socketserver.py", line 467, in server_bind
self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use
开始抓取代理
获取器开始执行
Process Process-2:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 526, in connect
sock = self._connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 583, in _connect
raise err
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 571, in _connect
sock.connect(socket_address)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 28, in schedule_getter
getter.run()
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 23, in run
if not self.is_over_threshold():
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 16, in is_over_threshold
if self.redis.count() >= POOL_UPPER_THRESHOLD:
File "/usr/local2/app/ProxyPool-master/proxypool/db.py", line 84, in count
return self.db.zcard(REDIS_KEY)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 2395, in zcard
return self.execute_command('ZCARD', name)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 836, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 1059, in get_connection
connection.connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 531, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 110 connecting to 120.79.34.216:6379. Connection timed out.
开始抓取代理
获取器开始执行
Process Process-2:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 526, in connect
sock = self._connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 583, in _connect
raise err
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 571, in _connect
sock.connect(socket_address)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

代理获取进程好像死亡了，这是怎么回事

运行过程中代理抓取进程好像死亡了，不知道是什么问题？
观察到测试进程和API进程一直在运行，代理抓取进程没有动，redis队列中代理也一直在减少，有人知道这是什么问题吗？

对爬取ip的代码进行了优化, 正则全部替换成pyquery来提取

`import json
import re
from .utils import get_page
from pyquery import PyQuery as pq

class ProxyMetaclass(type):
def new(cls, name, bases, attrs):
count = 0
attrs['CrawlFunc'] = []
for k, v in attrs.items():
if 'crawl_' in k:
attrs['CrawlFunc'].append(k)
count += 1
attrs['CrawlFuncCount'] = count
return type.new(cls, name, bases, attrs)

class Crawler(object, metaclass=ProxyMetaclass):
def get_proxies(self, callback):
proxies = []
for proxy in eval(f"self.{callback}()"):
print('成功获取到代理', proxy)
proxies.append(proxy)
return proxies

def crawl_daili66(self, page_count=4):
    """
    获取代理66
    :param page_count: 页码
    :return: 代理
    """
    start_url = 'http://www.66ip.cn/{}.html'
    urls = [start_url.format(page) for page in range(1, page_count + 1)]
    for url in urls:
        print('Crawling', url)
        html = get_page(url)
        if html:
            doc = pq(html)
            trs = doc('.containerbox table tr:gt(0)').items()  # index > 0  第0个tr节点里面没有ip和port
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])


def crawl_ip3366(self):
    for i in range(1, 4):
        start_url = 'http://www.ip3366.net/?stype=1&page={}'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#container #list table tbody tr').items()
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])

def crawl_kuaidaili(self):
    for i in range(1, 4):
        start_url = 'http://www.kuaidaili.com/free/inha/{}/'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#content .con-body #list table tbody tr').items()
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])

def crawl_iphai(self):
    start_url = 'http://www.iphai.com/'
    html = get_page(start_url)
    # print(html)
    if html:
        doc = pq(html)
        trs = doc('.container .table tr:gt(0)').items()
        for tr in trs:
            ip = tr.find('td:nth-child(1)').text()
            port = tr.find('td:nth-child(2)').text()
            yield ':'.join([ip.strip(), port.strip()])

def crawl_xicidaili(self):
    for i in range(1, 3):
        start_url = 'http://www.xicidaili.com/nn/{}'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#wrapper #body table tr:gt(0)').items()
            for tr in trs:
                ip = tr.find('td:nth-child(2)').text()
                port = tr.find('td:nth-child(3)').text()
                yield ':'.join([ip.strip(), port.strip()])


def crawl_data5u(self):
    start_url = 'http://www.data5u.com/'
    html = get_page(start_url)
    if html:
        doc = pq(html)
        uls = doc('.wlist>ul ul:gt(0)').items()
        for ul in uls:
            ip = ul.find('span:nth-child(1)').text()
            port = ul.find('span:nth-child(2)').text()
            yield ':'.join([ip.strip(), port.strip()])

        `

自己替换一下就行了, 亲测没问题, 当前时间2019-10-10

增加对pipenv包管理器的支持

增加对pipenv包管理器的支持, 本机已经测试通过.
redis: v=4.0.9

获取器过一段时间就宕掉了？

运行一段时间，自动就宕掉了，是什么情况，可以解决吗？

报了这个错误：requests.exceptions.ConnectionError：10054

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))

本地网页的代理是不是已经测试过可以使用的？

能否获取能够破网的代理

请问 AttributeError: type object 'URL' has no attribute 'build' 这个怎么解决

File "run.py", line 1, in
from proxypool.scheduler import Scheduler
File "C:\ProxyPool-master\proxypool\scheduler.py", line 4, in
from proxypool.getter import Getter
File "C:\ProxyPool-master\proxypool\getter.py", line 1, in
from proxypool.tester import Tester
File "C:\ProxyPool-master\proxypool\tester.py", line 2, in
import aiohttp
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp_init_.py", line 6, in
from .client import (
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\client.py", line 32, in
from . import hdrs, http, payload
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\http.py", line 7, in
from .http_parser import (
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\http_parser.py", line 755, in
from ._http_parser import (HttpRequestParser, # type: ignore # noqa

File "aiohttp_http_parser.pyx", line 44, in init aiohttp._http_parser
AttributeError: type object 'URL' has no attribute 'build'

我访问localhost:5555/random时不能换代理多次刷新只有一个最初的代理地址，请问一下是什么问题？

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environments (please complete the following information):

OS: [e.g. macOS 10.15.2]
Python [e.g. Python 3.6]
Browser [e.g. Chrome 67 ]

Additional context
Add any other context about the problem here.

zincrby 新的版本redis-py有改动

zincrby(name, amount, value)

需要将源代码中的zincrby第二、三参数换个顺序

value is not a valid float

redis.exceptions.ResponseError: value is not a valid float

关于redis-py版本问题

redis-py 3.X版和2.X版 zadd和zincrby有变化
3.X版中的zadd需要传入一个dict，（element-names -> score）
zincrby参数中amount和value互换

代码中url错误导致的retrying.RetryError: RetryError[Attempts: 3, Value: None]

url错误导致的retrying模块报错问题

proxypool.crawlers.daili66.BASE_URL:http://www.664ip.cn/{page}.html ,这个url的域名应该是写错了，改成 www.66ip.cn 就可以正常运行

如何在pycharm里调试该项目

我使用一个远程的环境，想在pycharm里调试该项目，但是每次Debug run.py 都显示文件无法找到，请问如何使用pycharm调试这个项目

what/

D:\Pycharm工作资料\代码流\venv\Scripts\python.exe C:/Users/ThinkPad/Downloads/ProxyPool-master/run.py
浠ｇ悊姹犲紑濮嬭繍琛�

Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
Debug mode: off
Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
Process Process-2:
寮�濮嬫姄鍙栦唬鐞�
鑾峰彇鍣ㄥ紑濮嬫墽琛�
Crawling http://www.66ip.cn/1.html
姝ｅ湪鎶撳彇 http://www.66ip.cn/1.html
鎶撳彇鎴愬姛 http://www.66ip.cn/1.html 200
鎴愬姛鑾峰彇鍒颁唬鐞� 177.185.148.46:58623
鎴愬姛鑾峰彇鍒颁唬鐞� 131.196.143.11:7
鎴愬姛鑾峰彇鍒颁唬鐞� 131.196.143.117:33729
鎴愬姛鑾峰彇鍒颁唬鐞� 43.243.141.126:53281
鎴愬姛鑾峰彇鍒颁唬鐞� 111.181.35.219:9999
Crawling http://www.66ip.cn/2.html
姝ｅ湪鎶撳彇 http://www.66ip.cn/2.html
鎶撳彇鎴愬姛 http://www.66ip.cn/2.html 200
鎴愬姛鑾峰彇鍒颁唬鐞� 170.0.112.226:50359
鎴愬姛鑾峰彇鍒颁唬鐞� 54.39.144.247:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 171.41.82.36:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 37.32.126.0:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 213.33.224.82:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 144.123.71.133:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.166.59:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 117.196.237.40:59250
鎴愬姛鑾峰彇鍒颁唬鐞� 121.61.3.110:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 212.200.126.14:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 47.107.245.9:4
鎴愬姛鑾峰彇鍒颁唬鐞� 47.107.245.94:3128
Crawling http://www.66ip.cn/3.html
姝ｅ湪鎶撳彇 http://www.66ip.cn/3.html
鎶撳彇鎴愬姛 http://www.66ip.cn/3.html 200
鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.162.175:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 110.52.235.60:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.1:0
鎴愬姛鑾峰彇鍒颁唬鐞� 175.100.185.151:53281
鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.10:6
鎴愬姛鑾峰彇鍒颁唬鐞� 179.127.249.5:3
鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.106:58553
鎴愬姛鑾峰彇鍒颁唬鐞� 179.127.249.53:46257
鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.183.4:5
鎴愬姛鑾峰彇鍒颁唬鐞� 1.20.101.221:55707
鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.183.45:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 91.219.171.8:4
Crawling http://www.66ip.cn/4.html
姝ｅ湪鎶撳彇 http://www.66ip.cn/4.html
鎶撳彇鎴愬姛 http://www.66ip.cn/4.html 200
鎴愬姛鑾峰彇鍒颁唬鐞� 91.219.171.84:43726
鎴愬姛鑾峰彇鍒颁唬鐞� 212.26.247.178:38418
鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.1:1
鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.11:3
鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.113:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 110.52.235.126:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 170.239.224.58:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 190.119.199.18:57333
鎴愬姛鑾峰彇鍒颁唬鐞� 5.0.0.815:0
鎴愬姛鑾峰彇鍒颁唬鐞� 190.152.182.150:53281
鎴愬姛鑾峰彇鍒颁唬鐞� 119.40.98.84:46119
鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.170.220:9999
Traceback (most recent call last):
File "D:\python\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "D:\python\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\scheduler.py", line 28, in schedule_getter
getter.run()
File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\getter.py", line 30, in run
self.redis.add(proxy)
File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\db.py", line 30, in add
return self.db.zadd(REDIS_KEY, score, proxy)
File "D:\Pycharm工作资料\代码流\venv\lib\site-packages\redis\client.py", line 2320, in zadd
for pair in iteritems(mapping):
File "D:\Pycharm工作资料\代码流\venv\lib\site-packages\redis_compat.py", line 122, in iteritems
return iter(x.items())
AttributeError: 'int' object has no attribute 'items'

Windows下运行正常，macOS和Linux下均报错，网上查了半天，依然一头雾水，求大神解惑。

代理池开始运行

Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
Debug mode: off
Running on http://0.0.0.0:7777/ (Press CTRL+C to quit)
开始抓取代理
获取器开始执行
Crawling http://www.66ip.cn/1.html
正在抓取 http://www.66ip.cn/1.html
抓取成功 http://www.66ip.cn/1.html 521
Crawling http://www.66ip.cn/2.html
正在抓取 http://www.66ip.cn/2.html
抓取成功 http://www.66ip.cn/2.html 521
Crawling http://www.66ip.cn/3.html
正在抓取 http://www.66ip.cn/3.html
抓取成功 http://www.66ip.cn/3.html 521
Crawling http://www.66ip.cn/4.html
正在抓取 http://www.66ip.cn/4.html
抓取成功 http://www.66ip.cn/4.html 521
Crawling http://www.proxy360.cn/Region/China
正在抓取 http://www.proxy360.cn/Region/China
抓取成功 http://www.proxy360.cn/Region/China 400
正在抓取 http://www.goubanjia.com/free/gngn/index.shtml
抓取成功 http://www.goubanjia.com/free/gngn/index.shtml 404
正在抓取 http://www.ip3366.net/?stype=1&page=1
抓取成功 http://www.ip3366.net/?stype=1&page=1 200
成功获取到代理 112.87.254.81:8118
成功获取到代理 103.115.180.96:42556
成功获取到代理 103.218.25.52:53281
成功获取到代理 80.211.55.179:3128
成功获取到代理 137.59.162.178:52497
成功获取到代理 165.90.209.141:31975
成功获取到代理 80.211.84.179:3128
成功获取到代理 103.108.96.159:46258
成功获取到代理 103.106.101.12:45100
成功获取到代理 112.84.85.164:8118
正在抓取 http://www.ip3366.net/?stype=1&page=2
抓取成功 http://www.ip3366.net/?stype=1&page=2 200
成功获取到代理 183.172.131.4:8118
成功获取到代理 112.67.35.134:8118
成功获取到代理 59.110.48.236:3128
成功获取到代理 111.224.137.25:80
成功获取到代理 138.121.31.108:53281
成功获取到代理 103.225.228.101:58732
成功获取到代理 222.181.10.102:8118
成功获取到代理 111.224.34.224:80
成功获取到代理 103.81.15.113:57803
成功获取到代理 101.27.22.144:61234
正在抓取 http://www.ip3366.net/?stype=1&page=3
抓取成功 http://www.ip3366.net/?stype=1&page=3 200
成功获取到代理 119.179.133.233:8060
成功获取到代理 119.179.143.43:8060
成功获取到代理 119.179.143.43:8060
成功获取到代理 106.58.248.101:80
成功获取到代理 119.254.94.71:52811
成功获取到代理 119.179.130.179:8060
成功获取到代理 27.208.85.141:8060
成功获取到代理 123.207.233.182:808
成功获取到代理 112.66.70.180:8060
成功获取到代理 170.82.21.168:53281
Process Process-2:
Traceback (most recent call last):
File "/usr/local/var/pyenv/versions/3.7.1/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/local/var/pyenv/versions/3.7.1/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/hao/Documents/Coding/ProxyPool/proxypool/scheduler.py", line 28, in schedule_getter
getter.run()
File "/Users/hao/Documents/Coding/ProxyPool/proxypool/getter.py", line 30, in run
self.redis.add(proxy)
File "/Users/hao/Documents/Coding/ProxyPool/proxypool/db.py", line 30, in add
return self.db.zadd(REDIS_KEY, score, proxy)
File "/usr/local/var/pyenv/versions/3.7.1/lib/python3.7/site-packages/redis/client.py", line 2263, in zadd
for pair in iteritems(mapping):
File "/usr/local/var/pyenv/versions/3.7.1/lib/python3.7/site-packages/redis/_compat.py", line 123, in iteritems
return iter(x.items())
AttributeError: 'int' object has no attribute 'items'

Macbook上碰到https的错误

➜  ProxyPool git:(master) pip3 install -r requirements.txt
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Collecting aiohttp>=1.3.3 (from -r requirements.txt (line 1))
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Could not fetch URL https://pypi.org/simple/aiohttp/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/aiohttp/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
  Could not find a version that satisfies the requirement aiohttp>=1.3.3 (from -r requirements.txt (line 1)) (from versions: )
No matching distribution found for aiohttp>=1.3.3 (from -r requirements.txt (line 1))
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
➜  ProxyPool git:(master)

大佬们，为什么我运行run.py之后就没反应了

程序也不报错，也不停止，但就是什么也不显示，那些print的语句我写了的。

无法定时爬取新的代理

请问为何只是在刚开始的时候爬取了一次代理之后会就不会定时爬取了？

运行run后卡在“获取器开始执行”

相关配置和安装等都搞定了，之前那个用pop实现的也能用，但这个我运行run后却卡在“获取器开始执行”，请问怎么解决？谢谢了。

aiohttp 不支持https代理，有办法可以批量测试https代理吗

在 VS Code 中运行异常 ~ 而且无法关闭

当在命令行中运行 python run.py 时，项目可以正常工作，可以通过 http://127.0.0.1:5555/random 获取到代理；但当使用 VS Code 直接运行（按 F5）时，会出现异常，而且关不掉，几次都是自己通过重启才关掉。

有大佬能解释一下么？

requirements.txt中的 redis版本有限制

requirements.txt中redis版本为redis>=2.10.5
默认会安装最新版，现在已经3.x了。
试验证明，操作zadd时会报错。

是否存在管理redis connectionpool的问题？

https://stackoverflow.com/questions/31663288/how-do-i-properly-use-connection-pools-in-redis
我在想每次请求链接redis都创建一个链接，不如写成
`redis_pool = None

class RedisClient(object):
def init(self, host=HOST, port=PORT):
global redis_pool

    if not redis_pool:
        if PASSWORD:
            redis_pool = redis.Redis(host=host, port=port, password=PASSWORD)
        else:
            redis_pool = redis.Redis(host=host, port=port)
        self._db = redis_pool
    else:
        self._db = redis_pool`

嗯把大神的直接拿过来运行，一开始还能运行，爬着爬着报错了，好难啊，感觉好复杂

代理池开始运行

Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
Debug mode: off
Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
开始抓取代理
获取器开始执行
Crawling http://www.66ip.cn/1.html
正在抓取 http://www.66ip.cn/1.html
抓取成功 http://www.66ip.cn/1.html 521
Crawling http://www.66ip.cn/2.html
正在抓取 http://www.66ip.cn/2.html
抓取成功 http://www.66ip.cn/2.html 521
Crawling http://www.66ip.cn/3.html
正在抓取 http://www.66ip.cn/3.html
抓取成功 http://www.66ip.cn/3.html 521
Crawling http://www.66ip.cn/4.html
正在抓取 http://www.66ip.cn/4.html
抓取成功 http://www.66ip.cn/4.html 521
正在抓取 http://www.ip3366.net/?stype=1&page=1
抓取成功 http://www.ip3366.net/?stype=1&page=1 200
成功获取到代理 222.135.25.243:8060
成功获取到代理 180.175.8.5:8060
成功获取到代理 119.180.131.25:8060
成功获取到代理 180.175.160.130:8060
成功获取到代理 119.180.177.138:8060
成功获取到代理 119.180.1.42:8060
成功获取到代理 171.112.165.22:9999
成功获取到代理 222.182.121.71:8118
成功获取到代理 118.81.68.2:80
成功获取到代理 117.166.3.51:8118
正在抓取 http://www.ip3366.net/?stype=1&page=2
抓取成功 http://www.ip3366.net/?stype=1&page=2 200
成功获取到代理 171.83.164.51:9999
成功获取到代理 47.101.189.13:80
成功获取到代理 171.112.164.149:9999
成功获取到代理 171.112.164.109:9999
成功获取到代理 119.97.237.74:80
成功获取到代理 197.234.42.73:8083
成功获取到代理 103.120.152.182:59068
成功获取到代理 117.168.86.102:8118
成功获取到代理 115.215.212.116:8118
成功获取到代理 103.244.91.61:8080
正在抓取 http://www.ip3366.net/?stype=1&page=3
抓取成功 http://www.ip3366.net/?stype=1&page=3 200
成功获取到代理 117.80.137.238:9999
成功获取到代理 103.233.145.133:8080
成功获取到代理 117.80.17.81:8118
成功获取到代理 171.83.165.10:9999
成功获取到代理 43.248.123.237:8080
成功获取到代理 113.227.182.15:8118
成功获取到代理 138.97.219.51:65301
成功获取到代理 117.41.142.159:8118
成功获取到代理 197.234.42.209:8083
成功获取到代理 197.234.44.125:8083
Process Process-2:
Traceback (most recent call last):
File "D:\Python\lib\multiprocessing\process.py", line 297, in _bootstrap
self.run()
File "D:\Python\lib\multiprocessing\process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "D:\spider-test\ProxyPool-master\proxypool\scheduler.py", line 28, in schedule_getter
getter.run()
File "D:\spider-test\ProxyPool-master\proxypool\getter.py", line 30, in run
self.redis.add(proxy)
File "D:\spider-test\ProxyPool-master\proxypool\db.py", line 30, in add
return self.db.zadd(REDIS_KEY, score, proxy)
File "D:\Python\lib\site-packages\redis\client.py", line 2320, in zadd
for pair in iteritems(mapping):
File "D:\Python\lib\site-packages\redis_compat.py", line 109, in iteritems
return iter(x.items())
AttributeError: 'int' object has no attribute 'items'

redis的有序集合zadd方法变更

def add(self, proxy, score=INITIAL_SCORE):
    """
    添加代理，设置分数为最高
    :param proxy: 代理
    :param score: 分数
    :return: 添加结果
    """
    if not re.match('\d+\.\d+\.\d+\.\d+\:\d+', proxy):
        print('代理不符合规范', proxy, '丢弃')
        return
    if not self.db.zscore(REDIS_KEY, proxy):
        dic={}
        dic[proxy] =score

        return self.db.zadd(REDIS_KEY, dic)

大神：能不能做个docker，环境变化太多了

如题

爬虫函数可以考虑增加一个直接读文件

粗略写了个，放在Crawler类里面，每行用“地址：端口”格式就行。
def crawl_file(self):
filename = 'proxy.txt' # txt文件和当前脚本在同一目录下，所以不用写具体路径
with open(filename, 'r') as file_to_read:
while True:
lines = file_to_read.readline() # 整行读取数据
if not lines:
break
yield lines

书上的和这个上面的都试过了就是不行

各个库以及Python的版本都是符合要求的
然后python run.py 的时候就报错
ImportError: cannot import name 'etree' from 'lxml'
实在是百度不到方法了，求解。

error in db.py: fixed the problem by change line 30 with:

return self.db.zadd(REDIS_KEY, {proxy: score})

写入时报错

运行报错

/proxypool/db.py", line 30, in add

return iter(x.items())
AttributeError: 'int' object has no attribute 'items'

api没有写get

不知道作者是忘了写还是，应该改成random，不然获取不到

http类型的ip测试结果不可用

如果使用下面的测试方式是没有问题的，另外一个问题是aiohttp不支持https的代理
response=requests.get('https://www.baidu.com',proxies=‘HTTP://125.123.139.131:9999’,timeout=3)

请教这个问题怎么解决？

Ip processing running

Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
Debug mode: off
Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
Refreshing ip
PoolAdder is working
Waiting for adding
Callback crawl_ip181
Error occurred during loading data. Trying to use cache server http://d2g6u4gh6d9rq0.cloudfront.net/browsers/fake_useragent_0.1.10.json
Traceback (most recent call last):
File "C:\Python\Python36\lib\urllib\request.py", line 1318, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "C:\Python\Python36\lib\http\client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Python\Python36\lib\http\client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Python\Python36\lib\http\client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Python\Python36\lib\http\client.py", line 1026, in _send_output
self.send(msg)
File "C:\Python\Python36\lib\http\client.py", line 964, in send
self.connect()
File "C:\Python\Python36\lib\http\client.py", line 1392, in connect
super().connect()
File "C:\Python\Python36\lib\http\client.py", line 936, in connect
(self.host,self.port), self.timeout, self.source_address)
File "C:\Python\Python36\lib\socket.py", line 722, in create_connection
raise err
File "C:\Python\Python36\lib\socket.py", line 713, in create_connection
sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 67, in get
context=context,
File "C:\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Python\Python36\lib\urllib\request.py", line 526, in open
response = self._open(req, data)
File "C:\Python\Python36\lib\urllib\request.py", line 544, in _open
'_open', req)
File "C:\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python\Python36\lib\urllib\request.py", line 1361, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Python\Python36\lib\urllib\request.py", line 1320, in do_open
raise URLError(err)
urllib.error.URLError:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 154, in load
for item in get_browsers(verify_ssl=verify_ssl):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 97, in get_browsers
html = get(settings.BROWSERS_STATS_PAGE, verify_ssl=verify_ssl)
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 84, in get
raise FakeUserAgentError('Maximum amount of retries reached')
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached
Process Process-2:
Traceback (most recent call last):
File "C:\Python\Python36\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Python\Python36\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\迅雷下载\ProxyPool-master\proxypool\schedule.py", line 130, in check_pool
adder.add_to_queue()
File "C:\迅雷下载\ProxyPool-master\proxypool\schedule.py", line 87, in add_to_queue
raw_proxies = self._crawler.get_raw_proxies(callback)
File "C:\迅雷下载\ProxyPool-master\proxypool\getter.py", line 28, in get_raw_proxies
for proxy in eval("self.{}()".format(callback)):
File "C:\迅雷下载\ProxyPool-master\proxypool\getter.py", line 35, in crawl_ip181
html = get_page(start_url)
File "C:\迅雷下载\ProxyPool-master\proxypool\utils.py", line 14, in get_page
'User-Agent': ua.random,
UnboundLocalError: local variable 'ua' referenced before assignment
Refreshing ip
Waiting for adding
Refreshing ip
Waiting for adding
Refreshing ip

setup.py 里面的console_script

我发现console_script里面写run:cli完全没有办法安装之后成功使用。请问为什么可以写成这个样子呢？我把run名称改为pool_run，脚本改成 pool_run:main 就可以正常使用了。

按照readme获取的端口不对

应该是localhost:5555/random吧

配置文件中的redis缓存配置暴露了

hi 你好我在用你的代码的时候发现你的redis配置暴露了.....

关于trace back中db.py的问题

看了之前的issues都说要把db.py中的add和max两个函数里的zadd方法的参数修改成字典形式（zadd(REDIS_KEY, {proxy：score})），以及decrease函数里的zincrby方法参数对调（zincrby(REDIS_KEY, -1， proxy)）但是我修改之后反而会出问题。
我去看了Redis-py的文档，不知道是不是有什么改版，发现作者原本写的才符合文档里的要求。
文档说明如下：

所以这两个地方其实不用改了。
不过有个地方似乎是错了，

，框住的第一个应该是MIN_SCORE吧？

crawl_ip3366（）函数是不是有一个多余了，源代码中并没有删除或注释

Crawl.py模块里是不是有个函数写多余了，crawl_ip3366写了两个。

新版redis中zadd改动

新的版本中zadd有改动，需要改成zadd(REDIS_KEY, {proxy: score})
一共两处，分别在RedisClient.add()和RedisClient.max()里

程序执行不成功

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environments (please complete the following information):

OS: [win10]
Python [Python 3.7]

Additional context
Add any other context about the problem here.

代码看起来太费劲了，必须完全按照你书上来，才走的通

代码看起来太费劲了，必须完全按照你书上来，才走的通，我又不想安装redis

运行后一直是代理请求失败

前面的都能正常运行，到了测试的时候就是代理请求失败，想寻求解决方法

代理池项目中setting.py文件相关配置

不算bug，建议：
1.在项目setting.py文件中，看到声明了LOG_DIR日志存储路径参数，但未使用。
应新建出...\project\ProxyPool\logs文件夹，并在配置文件中修改：
logger.add(env.str('LOG_RUNTIME_FILE', 'runtime.log'), level='DEBUG', rotation='1 week', retention='20 days')
logger.add(env.str('LOG_ERROR_FILE', 'error.log'), level='ERROR', rotation='1 week')

修改为：
logger.add(env.str('LOG_RUNTIME_FILE', f'{LOG_DIR}/runtime.log'), level='DEBUG', rotation='1 week', retention='20 days')
logger.add(env.str('LOG_ERROR_FILE', f'{LOG_DIR}/error.log'), level='ERROR', rotation='1 week')

2.setting.py文件中ENABLE_TESTER， ENABLE_GETTER， ENABLE_SERVER开关参数若都为False时，运行run.py文件报错（try方法中finally还会报错），可修改scheduler.py文件。（此条有点杠精，可忽略）

RequestsDependencyWarning: urllib3 (1.23) or chardet (2.3.0) doesn't match a supported version!

怎么解决

优化付费代理按需使用的需求

免费资源可用率不高，希望是一个付费ip和免费ip结合的代理池。
这时候就有一个问题：无限测试付费ip，只扣费了，但是实际业务没有在用代理。

希望优化付费代理按需使用机制：
付费代理只有在有爬虫需求的时候，启动拉取，并且定制从代理服务商拉取IP个数。

弹出错误，请求帮助attributes() got an unexpected keyword argument 'frozen'

Traceback (most recent call last):
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\run.py", line 1, in
from proxypool.scheduler import Scheduler
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\scheduler.py", line 4, in
from proxypool.getter import Getter
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\getter.py", line 1, in
from proxypool.tester import Tester
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\tester.py", line 2, in
import aiohttp
File "D:\Anaconda3\lib\site-packages\aiohttp_init_.py", line 6, in
from .client import * # noqa
File "D:\Anaconda3\lib\site-packages\aiohttp\client.py", line 16, in
from . import client_exceptions, client_reqrep
File "D:\Anaconda3\lib\site-packages\aiohttp\client_reqrep.py", line 18, in
from . import hdrs, helpers, http, multipart, payload
File "D:\Anaconda3\lib\site-packages\aiohttp\helpers.py", line 161, in
@attr.s(frozen=True, slots=True)
TypeError: attributes() got an unexpected keyword argument 'frozen'

python3webspider / proxypool Goto Github PK

proxypool's People

Stargazers

Watchers

Forkers

proxypool's Issues

Recommend Projects

Recommend Topics

Recommend Org