Giter VIP home page Giter VIP logo

proxypool's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proxypool's Issues

AttributeError: 'OutStream' object has no attribute 'buffer'

#启动代理池

from proxypool.scheduler import Scheduler
import sys
import io

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

def main():
try:
s = Scheduler()
s.run()
except:
main()

if name == 'main':
main()

AttributeError: 'OutStream' object has no attribute 'buffer'

项目部署到Linux直接报错了

代理池开始运行

  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: This is a development server. Do not use it in a production deployment.
    Use a production WSGI server instead.
  • Debug mode: off
  • Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
    代理池开始运行
  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: This is a development server. Do not use it in a production deployment.
    Use a production WSGI server instead.
  • Debug mode: off
    Process Process-3:
    Traceback (most recent call last):
    File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
    File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
    File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 35, in schedule_api
    app.run(API_HOST, API_PORT)
    File "/usr/local/python3/lib/python3.6/site-packages/flask/app.py", line 990, in run
    run_simple(host, port, self, **options)
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 1009, in run_simple
    inner()
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 962, in inner
    fd=fd,
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 805, in make_server
    host, port, app, request_handler, passthrough_errors, ssl_context, fd=fd
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 698, in init
    HTTPServer.init(self, server_address, handler)
    File "/usr/local/python3/lib/python3.6/socketserver.py", line 453, in init
    self.server_bind()
    File "/usr/local/python3/lib/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
    File "/usr/local/python3/lib/python3.6/socketserver.py", line 467, in server_bind
    self.socket.bind(self.server_address)
    OSError: [Errno 98] Address already in use
    代理池开始运行
  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: This is a development server. Do not use it in a production deployment.
    Use a production WSGI server instead.
  • Debug mode: off
    Process Process-3:
    Traceback (most recent call last):
    File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
    File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
    File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 35, in schedule_api
    app.run(API_HOST, API_PORT)
    File "/usr/local/python3/lib/python3.6/site-packages/flask/app.py", line 990, in run
    run_simple(host, port, self, **options)
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 1009, in run_simple
    inner()
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 962, in inner
    fd=fd,
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 805, in make_server
    host, port, app, request_handler, passthrough_errors, ssl_context, fd=fd
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 698, in init
    HTTPServer.init(self, server_address, handler)
    File "/usr/local/python3/lib/python3.6/socketserver.py", line 453, in init
    self.server_bind()
    File "/usr/local/python3/lib/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
    File "/usr/local/python3/lib/python3.6/socketserver.py", line 467, in server_bind
    self.socket.bind(self.server_address)
    OSError: [Errno 98] Address already in use
    开始抓取代理
    获取器开始执行
    Process Process-2:
    Traceback (most recent call last):
    File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 526, in connect
    sock = self._connect()
    File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 583, in _connect
    raise err
    File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 571, in _connect
    sock.connect(socket_address)
    TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 28, in schedule_getter
getter.run()
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 23, in run
if not self.is_over_threshold():
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 16, in is_over_threshold
if self.redis.count() >= POOL_UPPER_THRESHOLD:
File "/usr/local2/app/ProxyPool-master/proxypool/db.py", line 84, in count
return self.db.zcard(REDIS_KEY)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 2395, in zcard
return self.execute_command('ZCARD', name)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 836, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 1059, in get_connection
connection.connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 531, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 110 connecting to 120.79.34.216:6379. Connection timed out.
开始抓取代理
获取器开始执行
Process Process-2:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 526, in connect
sock = self._connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 583, in _connect
raise err
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 571, in _connect
sock.connect(socket_address)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 28, in schedule_getter
getter.run()
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 23, in run
if not self.is_over_threshold():
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 16, in is_over_threshold
if self.redis.count() >= POOL_UPPER_THRESHOLD:
File "/usr/local2/app/ProxyPool-master/proxypool/db.py", line 84, in count
return self.db.zcard(REDIS_KEY)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 2395, in zcard
return self.execute_command('ZCARD', name)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 836, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 1059, in get_connection
connection.connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 531, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 110 connecting to 120.79.34.216:6379. Connection timed out.

代理获取进程好像死亡了,这是怎么回事

运行过程中代理抓取进程好像死亡了,不知道是什么问题?
观察到测试进程和API进程一直在运行,代理抓取进程没有动,redis队列中代理也一直在减少,有人知道这是什么问题吗?

对爬取ip的代码进行了优化, 正则全部替换成pyquery来提取

`import json
import re
from .utils import get_page
from pyquery import PyQuery as pq

class ProxyMetaclass(type):
def new(cls, name, bases, attrs):
count = 0
attrs['CrawlFunc'] = []
for k, v in attrs.items():
if 'crawl_' in k:
attrs['CrawlFunc'].append(k)
count += 1
attrs['CrawlFuncCount'] = count
return type.new(cls, name, bases, attrs)

class Crawler(object, metaclass=ProxyMetaclass):
def get_proxies(self, callback):
proxies = []
for proxy in eval(f"self.{callback}()"):
print('成功获取到代理', proxy)
proxies.append(proxy)
return proxies

def crawl_daili66(self, page_count=4):
    """
    获取代理66
    :param page_count: 页码
    :return: 代理
    """
    start_url = 'http://www.66ip.cn/{}.html'
    urls = [start_url.format(page) for page in range(1, page_count + 1)]
    for url in urls:
        print('Crawling', url)
        html = get_page(url)
        if html:
            doc = pq(html)
            trs = doc('.containerbox table tr:gt(0)').items()  # index > 0  第0个tr节点里面没有ip和port
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])


def crawl_ip3366(self):
    for i in range(1, 4):
        start_url = 'http://www.ip3366.net/?stype=1&page={}'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#container #list table tbody tr').items()
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])

def crawl_kuaidaili(self):
    for i in range(1, 4):
        start_url = 'http://www.kuaidaili.com/free/inha/{}/'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#content .con-body #list table tbody tr').items()
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])

def crawl_iphai(self):
    start_url = 'http://www.iphai.com/'
    html = get_page(start_url)
    # print(html)
    if html:
        doc = pq(html)
        trs = doc('.container .table tr:gt(0)').items()
        for tr in trs:
            ip = tr.find('td:nth-child(1)').text()
            port = tr.find('td:nth-child(2)').text()
            yield ':'.join([ip.strip(), port.strip()])

def crawl_xicidaili(self):
    for i in range(1, 3):
        start_url = 'http://www.xicidaili.com/nn/{}'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#wrapper #body table tr:gt(0)').items()
            for tr in trs:
                ip = tr.find('td:nth-child(2)').text()
                port = tr.find('td:nth-child(3)').text()
                yield ':'.join([ip.strip(), port.strip()])


def crawl_data5u(self):
    start_url = 'http://www.data5u.com/'
    html = get_page(start_url)
    if html:
        doc = pq(html)
        uls = doc('.wlist>ul ul:gt(0)').items()
        for ul in uls:
            ip = ul.find('span:nth-child(1)').text()
            port = ul.find('span:nth-child(2)').text()
            yield ':'.join([ip.strip(), port.strip()])

        `

自己替换一下就行了, 亲测没问题, 当前时间2019-10-10

请问 AttributeError: type object 'URL' has no attribute 'build' 这个怎么解决

File "run.py", line 1, in
from proxypool.scheduler import Scheduler
File "C:\ProxyPool-master\proxypool\scheduler.py", line 4, in
from proxypool.getter import Getter
File "C:\ProxyPool-master\proxypool\getter.py", line 1, in
from proxypool.tester import Tester
File "C:\ProxyPool-master\proxypool\tester.py", line 2, in
import aiohttp
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp_init_.py", line 6, in
from .client import (
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\client.py", line 32, in
from . import hdrs, http, payload
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\http.py", line 7, in
from .http_parser import (
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\http_parser.py", line 755, in
from ._http_parser import (HttpRequestParser, # type: ignore # noqa

File "aiohttp_http_parser.pyx", line 44, in init aiohttp._http_parser
AttributeError: type object 'URL' has no attribute 'build'

我访问localhost:5555/random时 不能换代理 多次刷新只有一个最初的代理地址,请问一下是什么问题?

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environments (please complete the following information):

  • OS: [e.g. macOS 10.15.2]
  • Python [e.g. Python 3.6]
  • Browser [e.g. Chrome 67 ]

Additional context
Add any other context about the problem here.

关于redis-py版本问题

redis-py 3.X版和2.X版 zadd和zincrby有变化
3.X版中的zadd需要传入一个dict,(element-names -> score)
zincrby参数中amount和value互换

如何在pycharm里调试该项目

我使用一个远程的环境,想在pycharm里调试该项目,但是每次Debug run.py 都显示文件无法找到,请问如何使用pycharm调试这个项目

what/

D:\Pycharm工作资料\代码流\venv\Scripts\python.exe C:/Users/ThinkPad/Downloads/ProxyPool-master/run.py
浠g悊姹犲紑濮嬭繍琛�

  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: Do not use the development server in a production environment.
    Use a production WSGI server instead.
  • Debug mode: off
  • Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
    Process Process-2:
    寮�濮嬫姄鍙栦唬鐞�
    鑾峰彇鍣ㄥ紑濮嬫墽琛�
    Crawling http://www.66ip.cn/1.html
    姝e湪鎶撳彇 http://www.66ip.cn/1.html
    鎶撳彇鎴愬姛 http://www.66ip.cn/1.html 200
    鎴愬姛鑾峰彇鍒颁唬鐞� 177.185.148.46:58623
    鎴愬姛鑾峰彇鍒颁唬鐞� 131.196.143.11:7
    鎴愬姛鑾峰彇鍒颁唬鐞� 131.196.143.117:33729
    鎴愬姛鑾峰彇鍒颁唬鐞� 43.243.141.126:53281
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.181.35.219:9999
    Crawling http://www.66ip.cn/2.html
    姝e湪鎶撳彇 http://www.66ip.cn/2.html
    鎶撳彇鎴愬姛 http://www.66ip.cn/2.html 200
    鎴愬姛鑾峰彇鍒颁唬鐞� 170.0.112.226:50359
    鎴愬姛鑾峰彇鍒颁唬鐞� 54.39.144.247:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 171.41.82.36:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 37.32.126.0:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 213.33.224.82:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 144.123.71.133:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.166.59:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 117.196.237.40:59250
    鎴愬姛鑾峰彇鍒颁唬鐞� 121.61.3.110:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 212.200.126.14:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 47.107.245.9:4
    鎴愬姛鑾峰彇鍒颁唬鐞� 47.107.245.94:3128
    Crawling http://www.66ip.cn/3.html
    姝e湪鎶撳彇 http://www.66ip.cn/3.html
    鎶撳彇鎴愬姛 http://www.66ip.cn/3.html 200
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.162.175:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 110.52.235.60:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.1:0
    鎴愬姛鑾峰彇鍒颁唬鐞� 175.100.185.151:53281
    鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.10:6
    鎴愬姛鑾峰彇鍒颁唬鐞� 179.127.249.5:3
    鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.106:58553
    鎴愬姛鑾峰彇鍒颁唬鐞� 179.127.249.53:46257
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.183.4:5
    鎴愬姛鑾峰彇鍒颁唬鐞� 1.20.101.221:55707
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.183.45:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 91.219.171.8:4
    Crawling http://www.66ip.cn/4.html
    姝e湪鎶撳彇 http://www.66ip.cn/4.html
    鎶撳彇鎴愬姛 http://www.66ip.cn/4.html 200
    鎴愬姛鑾峰彇鍒颁唬鐞� 91.219.171.84:43726
    鎴愬姛鑾峰彇鍒颁唬鐞� 212.26.247.178:38418
    鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.1:1
    鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.11:3
    鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.113:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 110.52.235.126:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 170.239.224.58:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 190.119.199.18:57333
    鎴愬姛鑾峰彇鍒颁唬鐞� 5.0.0.815:0
    鎴愬姛鑾峰彇鍒颁唬鐞� 190.152.182.150:53281
    鎴愬姛鑾峰彇鍒颁唬鐞� 119.40.98.84:46119
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.170.220:9999
    Traceback (most recent call last):
    File "D:\python\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
    File "D:\python\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
    File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\scheduler.py", line 28, in schedule_getter
    getter.run()
    File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\getter.py", line 30, in run
    self.redis.add(proxy)
    File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\db.py", line 30, in add
    return self.db.zadd(REDIS_KEY, score, proxy)
    File "D:\Pycharm工作资料\代码流\venv\lib\site-packages\redis\client.py", line 2320, in zadd
    for pair in iteritems(mapping):
    File "D:\Pycharm工作资料\代码流\venv\lib\site-packages\redis_compat.py", line 122, in iteritems
    return iter(x.items())
    AttributeError: 'int' object has no attribute 'items'

Windows下运行正常,macOS和Linux下均报错,网上查了半天,依然一头雾水,求大神解惑。

代理池开始运行

Macbook上碰到https的错误

➜  ProxyPool git:(master) pip3 install -r requirements.txt
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Collecting aiohttp>=1.3.3 (from -r requirements.txt (line 1))
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Could not fetch URL https://pypi.org/simple/aiohttp/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/aiohttp/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
  Could not find a version that satisfies the requirement aiohttp>=1.3.3 (from -r requirements.txt (line 1)) (from versions: )
No matching distribution found for aiohttp>=1.3.3 (from -r requirements.txt (line 1))
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
➜  ProxyPool git:(master)

是否存在管理redis connectionpool的问题?

https://stackoverflow.com/questions/31663288/how-do-i-properly-use-connection-pools-in-redis
我在想每次请求链接redis都创建一个链接,不如写成
`redis_pool = None

class RedisClient(object):
def init(self, host=HOST, port=PORT):
global redis_pool

    if not redis_pool:
        if PASSWORD:
            redis_pool = redis.Redis(host=host, port=port, password=PASSWORD)
        else:
            redis_pool = redis.Redis(host=host, port=port)
        self._db = redis_pool
    else:
        self._db = redis_pool`

嗯把大神的直接拿过来运行,一开始还能运行,爬着爬着报错了,好难啊,感觉好复杂

代理池开始运行

  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: Do not use the development server in a production environment.
    Use a production WSGI server instead.
  • Debug mode: off
  • Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
    开始抓取代理
    获取器开始执行
    Crawling http://www.66ip.cn/1.html
    正在抓取 http://www.66ip.cn/1.html
    抓取成功 http://www.66ip.cn/1.html 521
    Crawling http://www.66ip.cn/2.html
    正在抓取 http://www.66ip.cn/2.html
    抓取成功 http://www.66ip.cn/2.html 521
    Crawling http://www.66ip.cn/3.html
    正在抓取 http://www.66ip.cn/3.html
    抓取成功 http://www.66ip.cn/3.html 521
    Crawling http://www.66ip.cn/4.html
    正在抓取 http://www.66ip.cn/4.html
    抓取成功 http://www.66ip.cn/4.html 521
    正在抓取 http://www.ip3366.net/?stype=1&page=1
    抓取成功 http://www.ip3366.net/?stype=1&page=1 200
    成功获取到代理 222.135.25.243:8060
    成功获取到代理 180.175.8.5:8060
    成功获取到代理 119.180.131.25:8060
    成功获取到代理 180.175.160.130:8060
    成功获取到代理 119.180.177.138:8060
    成功获取到代理 119.180.1.42:8060
    成功获取到代理 171.112.165.22:9999
    成功获取到代理 222.182.121.71:8118
    成功获取到代理 118.81.68.2:80
    成功获取到代理 117.166.3.51:8118
    正在抓取 http://www.ip3366.net/?stype=1&page=2
    抓取成功 http://www.ip3366.net/?stype=1&page=2 200
    成功获取到代理 171.83.164.51:9999
    成功获取到代理 47.101.189.13:80
    成功获取到代理 171.112.164.149:9999
    成功获取到代理 171.112.164.109:9999
    成功获取到代理 119.97.237.74:80
    成功获取到代理 197.234.42.73:8083
    成功获取到代理 103.120.152.182:59068
    成功获取到代理 117.168.86.102:8118
    成功获取到代理 115.215.212.116:8118
    成功获取到代理 103.244.91.61:8080
    正在抓取 http://www.ip3366.net/?stype=1&page=3
    抓取成功 http://www.ip3366.net/?stype=1&page=3 200
    成功获取到代理 117.80.137.238:9999
    成功获取到代理 103.233.145.133:8080
    成功获取到代理 117.80.17.81:8118
    成功获取到代理 171.83.165.10:9999
    成功获取到代理 43.248.123.237:8080
    成功获取到代理 113.227.182.15:8118
    成功获取到代理 138.97.219.51:65301
    成功获取到代理 117.41.142.159:8118
    成功获取到代理 197.234.42.209:8083
    成功获取到代理 197.234.44.125:8083
    Process Process-2:
    Traceback (most recent call last):
    File "D:\Python\lib\multiprocessing\process.py", line 297, in _bootstrap
    self.run()
    File "D:\Python\lib\multiprocessing\process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
    File "D:\spider-test\ProxyPool-master\proxypool\scheduler.py", line 28, in schedule_getter
    getter.run()
    File "D:\spider-test\ProxyPool-master\proxypool\getter.py", line 30, in run
    self.redis.add(proxy)
    File "D:\spider-test\ProxyPool-master\proxypool\db.py", line 30, in add
    return self.db.zadd(REDIS_KEY, score, proxy)
    File "D:\Python\lib\site-packages\redis\client.py", line 2320, in zadd
    for pair in iteritems(mapping):
    File "D:\Python\lib\site-packages\redis_compat.py", line 109, in iteritems
    return iter(x.items())
    AttributeError: 'int' object has no attribute 'items'

redis的有序集合zadd方法变更

def add(self, proxy, score=INITIAL_SCORE):
    """
    添加代理,设置分数为最高
    :param proxy: 代理
    :param score: 分数
    :return: 添加结果
    """
    if not re.match('\d+\.\d+\.\d+\.\d+\:\d+', proxy):
        print('代理不符合规范', proxy, '丢弃')
        return
    if not self.db.zscore(REDIS_KEY, proxy):
        dic={}
        dic[proxy] =score

        return self.db.zadd(REDIS_KEY, dic)

爬虫函数可以考虑增加一个直接读文件

粗略写了个,放在Crawler类里面,每行用“地址:端口”格式就行。
def crawl_file(self):
filename = 'proxy.txt' # txt文件和当前脚本在同一目录下,所以不用写具体路径
with open(filename, 'r') as file_to_read:
while True:
lines = file_to_read.readline() # 整行读取数据
if not lines:
break
yield lines

写入时报错

运行报错

/proxypool/db.py", line 30, in add

return iter(x.items())
AttributeError: 'int' object has no attribute 'items'

api没有写get

不知道作者是忘了写还是,应该改成random,不然获取不到

请教这个问题怎么解决?

Ip processing running

  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: Do not use the development server in a production environment.
    Use a production WSGI server instead.
  • Debug mode: off
  • Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
    Refreshing ip
    PoolAdder is working
    Waiting for adding
    Callback crawl_ip181
    Error occurred during loading data. Trying to use cache server http://d2g6u4gh6d9rq0.cloudfront.net/browsers/fake_useragent_0.1.10.json
    Traceback (most recent call last):
    File "C:\Python\Python36\lib\urllib\request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
    File "C:\Python\Python36\lib\http\client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
    File "C:\Python\Python36\lib\http\client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
    File "C:\Python\Python36\lib\http\client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
    File "C:\Python\Python36\lib\http\client.py", line 1026, in _send_output
    self.send(msg)
    File "C:\Python\Python36\lib\http\client.py", line 964, in send
    self.connect()
    File "C:\Python\Python36\lib\http\client.py", line 1392, in connect
    super().connect()
    File "C:\Python\Python36\lib\http\client.py", line 936, in connect
    (self.host,self.port), self.timeout, self.source_address)
    File "C:\Python\Python36\lib\socket.py", line 722, in create_connection
    raise err
    File "C:\Python\Python36\lib\socket.py", line 713, in create_connection
    sock.connect(sa)
    socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 67, in get
context=context,
File "C:\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Python\Python36\lib\urllib\request.py", line 526, in open
response = self._open(req, data)
File "C:\Python\Python36\lib\urllib\request.py", line 544, in _open
'_open', req)
File "C:\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python\Python36\lib\urllib\request.py", line 1361, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Python\Python36\lib\urllib\request.py", line 1320, in do_open
raise URLError(err)
urllib.error.URLError:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 154, in load
for item in get_browsers(verify_ssl=verify_ssl):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 97, in get_browsers
html = get(settings.BROWSERS_STATS_PAGE, verify_ssl=verify_ssl)
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 84, in get
raise FakeUserAgentError('Maximum amount of retries reached')
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached
Process Process-2:
Traceback (most recent call last):
File "C:\Python\Python36\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Python\Python36\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\迅雷下载\ProxyPool-master\proxypool\schedule.py", line 130, in check_pool
adder.add_to_queue()
File "C:\迅雷下载\ProxyPool-master\proxypool\schedule.py", line 87, in add_to_queue
raw_proxies = self._crawler.get_raw_proxies(callback)
File "C:\迅雷下载\ProxyPool-master\proxypool\getter.py", line 28, in get_raw_proxies
for proxy in eval("self.{}()".format(callback)):
File "C:\迅雷下载\ProxyPool-master\proxypool\getter.py", line 35, in crawl_ip181
html = get_page(start_url)
File "C:\迅雷下载\ProxyPool-master\proxypool\utils.py", line 14, in get_page
'User-Agent': ua.random,
UnboundLocalError: local variable 'ua' referenced before assignment
Refreshing ip
Waiting for adding
Refreshing ip
Waiting for adding
Refreshing ip

setup.py 里面的console_script

我发现console_script里面写run:cli完全没有办法安装之后成功使用。请问为什么可以写成这个样子呢?我把run名称改为pool_run,脚本改成 pool_run:main 就可以正常使用了。

关于trace back中db.py的问题

看了之前的issues都说要把db.py中的add和max两个函数里的zadd方法的参数修改成字典形式(zadd(REDIS_KEY, {proxy:score})),以及decrease函数里的zincrby方法参数对调(zincrby(REDIS_KEY, -1, proxy))但是我修改之后反而会出问题。
我去看了Redis-py的文档,不知道是不是有什么改版,发现作者原本写的才符合文档里的要求。
文档说明如下:
image
image

所以这两个地方其实不用改了。
不过有个地方似乎是错了,
image
,框住的第一个应该是MIN_SCORE吧?

新版redis中zadd改动

新的版本中zadd有改动,需要改成zadd(REDIS_KEY, {proxy: score})
一共两处,分别在RedisClient.add()和RedisClient.max()里

程序执行不成功

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.
image

Environments (please complete the following information):

  • OS: [win10]
  • Python [Python 3.7]

Additional context
Add any other context about the problem here.

代理池项目中setting.py文件相关配置

不算bug,建议:
1.在项目setting.py文件中,看到声明了LOG_DIR日志存储路径参数,但未使用。
应新建出...\project\ProxyPool\logs文件夹,并在配置文件中修改:
logger.add(env.str('LOG_RUNTIME_FILE', 'runtime.log'), level='DEBUG', rotation='1 week', retention='20 days')
logger.add(env.str('LOG_ERROR_FILE', 'error.log'), level='ERROR', rotation='1 week')

修改为:
logger.add(env.str('LOG_RUNTIME_FILE', f'{LOG_DIR}/runtime.log'), level='DEBUG', rotation='1 week', retention='20 days')
logger.add(env.str('LOG_ERROR_FILE', f'{LOG_DIR}/error.log'), level='ERROR', rotation='1 week')

2.setting.py文件中ENABLE_TESTER, ENABLE_GETTER, ENABLE_SERVER开关参数若都为False时,运行run.py文件报错(try方法中finally还会报错),可修改scheduler.py文件。(此条有点杠精,可忽略)
开关参数配置

优化付费代理按需使用的需求

免费资源可用率不高,希望是一个付费ip和免费ip结合的代理池。
这时候就有一个问题:无限测试付费ip,只扣费了,但是实际业务没有在用代理。

希望优化付费代理按需使用机制:
付费代理只有在有爬虫需求的时候,启动拉取,并且定制从代理服务商拉取IP个数。

弹出错误,请求帮助attributes() got an unexpected keyword argument 'frozen'

Traceback (most recent call last):
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\run.py", line 1, in
from proxypool.scheduler import Scheduler
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\scheduler.py", line 4, in
from proxypool.getter import Getter
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\getter.py", line 1, in
from proxypool.tester import Tester
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\tester.py", line 2, in
import aiohttp
File "D:\Anaconda3\lib\site-packages\aiohttp_init_.py", line 6, in
from .client import * # noqa
File "D:\Anaconda3\lib\site-packages\aiohttp\client.py", line 16, in
from . import client_exceptions, client_reqrep
File "D:\Anaconda3\lib\site-packages\aiohttp\client_reqrep.py", line 18, in
from . import hdrs, helpers, http, multipart, payload
File "D:\Anaconda3\lib\site-packages\aiohttp\helpers.py", line 161, in
@attr.s(frozen=True, slots=True)
TypeError: attributes() got an unexpected keyword argument 'frozen'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.