Giter VIP home page Giter VIP logo

proxyscrape's Introduction

Proxy Scrape

Travis Coveralls github PyPI PyPI - Wheel PyPI - Python Version PyPI - License

A library for retrieving free proxies (HTTP, HTTPS, SOCKS4, SOCKS5). Supports Python 2.7+ and 3.4+.

NOTE: This library isn't designed for production use. It's advised to use your own proxies or purchase a service which provides an API. These are merely free ones that are retrieved from sites and should only be used for development or testing purposes.

import proxyscrape

collector = proxyscrape.create_collector('default', 'http')  # Create a collector for http resources
proxy = collector.get_proxy({'country': 'united states'})  # Retrieve a united states proxy

Installation

The latest version of proxyscrape is available via pip:

$ pip install proxyscrape

Alternatively, you can download and install from source:

$ python setup.py install

Provided Proxies

Current proxies provided are scraped from various sites which offer free HTTP, HTTPS, SOCKS4, and SOCKS5 proxies; and don't require headless browsers or selenium to retrieve. The list of sites proxies retrieved are shown below.

resource resource type url
anonymous-proxy http, https https://free-proxy-list.net/anonymous-proxy.html
free-proxy-list http, https https://free-proxy-list.net
proxy-daily-http proxy-daily-socks4 proxy-daily-socks5 http socks4 socks5 http://www.proxy-daily.com
socks-proxy socks4, socks5 https://www.socks-proxy.net
ssl-proxy https https://www.sslproxies.org
uk-proxy http, https https://free-proxy-list.net/uk-proxy.html
us-proxy http, https https://www.us-proxy.org

See Integration section for additional proxies.

Getting Started

Proxy Scrape is a library aimed at providing an efficient an easy means of retrieving proxies for web-scraping purposes. The proxies retrieved are available from sites providing free proxies. The proxies provided, as shown in the above table, can be of one of the following types (referred to as a resource type): http, https, socks4, and socks5.

Collectors

Collectors serve as the interface to retrieving proxies. They are instantiating at module-level and can be retrieved and re-used in different parts of the application (similar to the Python logging library). Collectors can be created and retrieved via the create_collector(...) and get_collector(...) functions.

from proxyscrape import create_collector, get_collector

collector = create_collector('my-collector', ['socks4', 'socks5'])

# Some other section of code
collector = get_collector('my-collector')

Each collector should have a unique name and be initialized only once. Typically, only a single collector of a given resource type should be utilized. Filters can then be applied to the proxies if specific criteria is desired.

When given one or more resources, the collector will use those to retrieve proxies. If one or more resource types are given, the resources for each of the types will be used to retrieve proxies.

Once created, proxies can be retrieved via the get_proxy(...) or the get_proxies(...) function. This optionally takes a filter_opts parameter which can filter by the following:

  • code (us, ca, ...)
  • country (united states, canada, ...)
  • anonymous (True, False)
  • type (http, https, socks4, socks5, ...)
from proxyscrape import create_collector

collector = create_collector('my-collector', 'http')

# Retrieve any http proxy
proxy = collector.get_proxy()

# Retrieve only 'us' proxies
proxy = collector.get_proxy({'code': 'us'})

# Retrieve only anonymous 'uk' or 'us' proxies
proxy = collector.get_proxy({'code': ('us', 'uk'), 'anonymous': True})

# Retrieve all 'ca' proxies
proxies = collector.get_proxies({'code': 'ca'})

Filters can be applied to every proxy retrieval from the collector via apply_filter(...). This is useful when the same filter is expected for any proxy retrieved.

from proxyscrape import create_collector

collector = create_collector('my-collector', 'http')

# Only retrieve 'uk' and 'us' proxies
collector.apply_filter({'code': 'us'})

# Filtered proxies
proxy = collector.get_proxy()

# Clear filter
collector.clear_filter()

Note that some filters may instead use specific resources to achieve the same results (i.e. 'us-proxy' or 'uk-proxy' for 'us' and 'uk' proxies).

Blacklists can be applied to a collector to prevent specific proxies from being retrieved. They accept either one or more Proxy objects, or a host + port number combination and won't allow retrieval of matching proxies. Proxies can be individually removed from blacklists or the entire blacklist can be cleared.

from proxyscrape import create_collector

collector = create_collector('my-collector', 'http')

# Add proxy to blacklist
collector.blacklist_proxy(Proxy('192.168.1.1', '80', None, None, None, 'http', 'my-resource'))
collector.blacklist_proxy(host='192.168.1.2', port='8080')

# Blacklisted proxies won't be included
proxy = get_proxy()

# Remove individual proxies
collector.remove_blacklist(host='192.168.1.1', port='80')

# Clear blacklist
collector.clear_blacklist()

Instead of permanently blacklisting a particular proxies, a proxy can instead be removed from internal memory. This allows it to be re-added to the pool upon a subsequent refresh.

from proxyscrape import create_collector

collector = create_collector('my-collector', 'http')

# Remove proxy from internal pool
collector.remove_proxy(Proxy('192.168.1.1', '80', None, None, 'http', 'my-resource'))

Apart from automatic refreshes when retrieving proxies, they can also be forcefully refreshed via the refresh_proxies(...) function.

from proxyscrape import create_collector

collector = create_collector('my-collector', 'http')

# Forcefully refresh
collector.refresh_proxies(force=True)

# Refresh only if proxies not refreshed within `refresh_interval`
collector.refresh_proxies(force=False)

Resources

Resources refer to a specific function that retrieves a set of proxies; the currently implemented proxies are all retrieves from scraping a particular web site.

Additional user-defined resources can be added to the pool of proxy retrieval functions via the add_resource(...) function. Resources can belong to multiple resource types.

from proxyscrape import add_resource

def func():
    return {Proxy('192.168.1.1', '80', 'us', 'united states', False, 'http', 'my-resource'), }

add_resource('my-resource', func, 'http')

As shown above, a resource doesn't necessarily have to scrape proxies from a web site. It can be return a hard-coded list of proxies, make a call to an api, read from a file, etc.

The set of library- and user-defined resources can be retrieved via the get_resources(...) function.

from proxyscrape import get_resources
resources = get_resources()

Resource Types

Resource types are groupings of resources that can be specified when defining a collector (opposed to giving a collection of resources.

Additional user-defined resource types can be added via the add_resource_type(...) function. Resources can optionally be added to a resource type when defining it.

from proxyscrape import add_resource_type
add_resource_type('my-resource-type')
add_resource_type('my-other-resource-type', 'my-resource')  # Define resources for resource type

The set of library- and user-defined resource types can be retrieved via the get_resource_types(...) function.

from proxyscrape import get_resource_types
resources = get_resource_types()

Integration

Integrations are proxy implementations that are specific to a particular website or API and have a distinctively separate use case.

ProxyScrape

The ProxyScrape.com API provides a means of accessing thousands of proxies of various types (HTTP, SOCKS4, SOCKS5) in an efficient manner. These are vetted and validated with a minimal response time.

The get_proxyscrape_resource(...) function is used to dynamically create a new resource for using the proxyscrape API. The resource name can then be added to a resource type and used like any other library- or user-defined resource. The following parameters are used for the API:

  • proxytype (http, socks4, socks5, all)
  • timeout (milliseconds > 0)
  • ssl (yes, no, all)
  • anonymity (elite, anonymous, transparent, all)
  • country (any Alpha 2 ISO country code, all)
from proxyscrape import get_proxyscrape_resource
resource_name = get_proxyscrape_resource(proxytype='http', timeout=5000, ssl='yes', anonymity='all', country='us')

Contribution

Contributions or suggestions are welcome! Feel free to open an issue if a bug is found or an enhancement is desired, or even a pull request.

Changelog

All changes and versioning information can be found in the CHANGELOG.

License

Copyright (c) 2018 Jared Gillespie. See LICENSE for details.

proxyscrape's People

Contributors

colossatr0n avatar ferluci avatar jaredlgillespie avatar rejoiceinhope avatar thibeaum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proxyscrape's Issues

Optimize retrieving proxies from store

When retrieving a proxy, the following occurs:

  1. If refresh is needed, scrape new proxies from sources
  2. Filter out each proxy from blacklist
  3. Pick a random proxy and return it

If a refresh doesn't occur, and the blacklist doesn't change, then step 1 + 2 should be skipped. Also, we can use inverted indexes for filtering based on anonymity, country code, etc. Then performing intersections (or unions) on the sets provided by the inverted indexes should yield the pool of applicable proxies. This should drastically improve performance when a large number of proxies are retrieved.

Also, please add methods for retrieving the number of proxies currently stored. Method should optionally take a filter for retrieving the number of proxies that this matches.

Proxies are filtered out when using a custom resource.

I may be using resources and resource types incorrectly, but I've looked through the code and think there might be an issue with get_proxyscrape_resource and custom resource types filtering out the proxies retrieved from get_proxyscrape_resource.

Example:

# Create proxyscrape api resource
api_resource = get_proxyscrape_resource(proxytype='http', timeout=5000, ssl='yes', anonymity='all', country='us')

# Add api resource to new resource type
add_resource_type("api", api_resource)

# Create collector
api_collector = create_collector('api-collector', 'api')

# Get proxies
api_collector.get_proxy()

The code above will return None, since get_proxy() applies "api" as a type filter, and the type applied to each proxy is set by the proxytype (which is different than the custom "api" resource type).

It seems to me that get_proxyscrape_resource should have an additional type parameter separate from proxytype so that it can be used by a custom resource.

An updated usage would look like this:

# Create proxyscrape api resource and specify the type.
api_resource = get_proxyscrape_resource(type='api', proxytype='http', timeout=5000, ssl='yes', anonymity='all', country='us')

# Add api resource to new resource type. This resource will filter out any proxies that don't
# have type "api".
add_resource_type("api", api_resource)

# Create collector.
# Since all proxies instantiated by get_proxyscrape_resource will have the type "api", they 
# won't be filtered out by the new resource type which has the same type.
api_collector = create_collector('api-collector', 'api')

# Get proxies. The type "api" is used to filter out proxies that don't have the type "api".
api_collector.get_proxy()

Am I missing something here? This seems like a major functionality flaw that also leaks into other places.

Outputting way too less proxies...

import proxyscrape
from proxyscrape import create_collector, get_collector
from pprint import pprint
collector = create_collector('my-collector', ['socks4', 'socks5'])
proxy = collector.get_proxies()
pprint(proxy)

Am I doing something wrong or why does this only print 1052 socks 4 and 5 proxies, isn't it supposed to be way more...?

Better Blacklist

  • Allow removing individual IPs from blacklist
  • Blacklist should be tuples of IP and port (ignoring misc proxy info)

Improve Proxy Selection

Proxy selection is done with sets and pop which is O(n). Implement with lists instead and provide means of random and unrandom selection.

Add option to ignore failed proxy retrievals

When multiple proxy functions are used, a user may not care that one of them fails to work. It may be beneficial to have an option that allows a user to ignore failed retrievals instead of bubbling up an error.

proxyscrape return none

>>> import proxyscrape
>>> collector = proxyscrape.create_collector('default', 'http')
>>> proxy = collector.get_proxy({'country': 'united states'})
>>> print(proxy)
None

Python 3.9.6

Add retrying to proxy fetching

Proxies should optionally be retried. Can either implement on methods or add dependency to retry.me.

Retrying should be an option when creating the collector and apply to all proxy functions.

Fix intermediate failures with Python 2.7

Specific tests occassionally fail with Python 2.7. Rerunning the tests has a chance of success. Other versions seem to work properly.

Sample failed test:

======================================================================
ERROR: test_add_resource_type_multiple_resources (tests.test_scrapers.TestResource)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/JaredLGillespie/proxyscrape/tests/test_scrapers.py", line 647, in test_add_resource_type_multiple_resources
    add_resource_type('my-resource-type', ('us-proxy', 'uk-proxy'))
  File "/home/travis/build/JaredLGillespie/proxyscrape/proxyscrape/scrapers.py", line 378, in add_resource_type
    '{} is already defined as a resource type'.format(name))
ResourceTypeAlreadyDefinedError: my-resource-type is already defined as a resource type
======================================================================
FAIL: test_add_resource_type_exception_if_duplicate_lock_check (tests.test_scrapers.TestResource)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/JaredLGillespie/proxyscrape/tests/test_scrapers.py", line 635, in test_add_resource_type_exception_if_duplicate_lock_check
    add_resource_type('my-resource-type')
AssertionError: ResourceTypeAlreadyDefinedError not raised
```

Integrate proxyscrape.com

Add support for https://proxyscrape.com/ proxies.

Add functions for the following new resource types:

Add simple integration tests

Test should make actual calls to each of the different proxy services to validate that they're still live and their content hasn't changed in a way that breaks our code.

1 test per proxy source should be sufficient. It should call the source and validate that some proxies were retrieved.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.