Giter VIP home page Giter VIP logo

https-everywhere-py's Introduction

https-everywhere : Privacy for Pythons

Codecov AppVeyor CI Cirrus CI

This project primarily provides requests adapters for Chrome HSTS Preload and HTTPS Everywhere rulesets.

At this stage, the focus is on correct efficient loading of the approx 25,000 rulesets of HTTPS Everywhere for use with any requests. Emphasis is on converting those rulesets to simpler or more common rules to reduce memory requirements.

Current list of data problems can be found in https_everywhere/_fixme.py. Many of these have patches sent upstream to the main HTTPS Everywhere project.

Usage

from https_everywhere.session import HTTPSEverywhereSession

s = HTTPSEverywhereSession()
r = s.get("http://freerangekitten.com/")
r.raise_for_status()

assert r.url == "https://freerangekitten.com/"
assert len(r.history) == 1
assert r.history[0].status_code == 302
assert r.history[0].reason == "HTTPS Everywhere"

The log will emit

[W 200226 09:40:55 _rules:632] Rejecting rule with pattern "^http://ww2\.epeat\.com/"
[W 200226 09:40:55 _rules:640] Rejecting ruleset EPEAT (partial) as it has no usable rules
[W 200226 09:40:55 _rules:632] Rejecting rule with pattern "^http://(?:dashboard(?:-cdn)?|g-pixel|pixel|segment-pixel)\.invitemedia\.com/"
[W 200226 09:40:55 _rules:632] Rejecting rule with pattern "^http://((?:a[lt]|s|sca)\d*|www)\.listrakbi\.com/"
[W 200226 09:40:55 _rules:640] Rejecting ruleset ListrakBI.com as it has no usable rules
[W 200226 09:40:55 _rules:632] Rejecting rule with pattern "^http://demo\.neobookings\.com/"
[W 200226 09:40:55 _rules:632] Rejecting rule with pattern "^http://(www\.)?partners\.peer1\.ca/"
[W 200226 09:40:55 _rules:640] Rejecting ruleset Peer1.ca (partial) as it has no usable rules
[W 200226 09:40:55 _rules:632] Rejecting rule with pattern "^http://support\.pickaweb\.co\.uk/(assets/)"
[W 200226 09:40:55 _rules:632] Rejecting rule with pattern "^http://www\.svenskaspel\.se/"
[W 200226 09:40:55 _rules:632] Rejecting rule with pattern "^http://cdn\.therepublic\.com/"
[W 200226 09:40:55 _rules:640] Rejecting ruleset The Republic (partial) as it has no usable rules

Adapters

There are many adapters in https_everywhere.adapter which can be used depending on use cases.

Adapters can be mounted on 'http://', or a narrower mount point.

  • HTTPBlockAdapter - Mount on 'http://' to block HTTP traffic
  • HTTPRedirectBlockAdapter - Mount on 'https://' to block HTTPS responses redirecting to HTTP
  • HTTPSEverywhereOnlyAdapter - Apply HTTPS Everywhere rules
  • ChromePreloadHSTSAdapter - Upgrade to HTTPS for sites on Chrome preload list
  • MozillaPreloadHSTSAdapter - Upgrade to HTTPS for sites on Mozilla preload list
  • HTTPSEverywhereAdapter - Chrome preload hsts and https everywhere rules combined
  • ForceHTTPSAdapter - Just use HTTPS, always, everywhere
  • PreferHTTPSAdapter - Check HTTP if there are any redirects, before switching to HTTPS.
  • UpgradeHTTPSAdapter - Force HTTPS, but fall back to HTTP when HTTPS problems occur.
  • SafeUpgradeHTTPSAdapter - First check HTTP if there are any redirects, force HTTPS, and fallback to HTTP.

Testing

To test

git clone https://github.com/jayvdb/https-everywhere-py
git clone https://github.com/EFForg/https-everywhere  # possibly use --depth 1
cd https-everywhere-py
tox

(Note: test_rules takes a long time to begin.)

Not implemented

  • custom local ruleset channels
  • cookie support
  • credentials in urls, such as http://eff:[email protected]/, which interfers with many rules, and also prevents exclusions from being applied
  • efficient memory structure for target mapping
  • rules with @platform='mixedcontent'; approx 800 rulesets ignored
  • rules with @default_off; approx 300 rulesets ignored, but all are mixedcontent
  • ruleset targets containing wildcards in the middle of the domain names (foo.*.com), which doesnt exist in the default channel
  • ruleset targets containing a wildcard at beginning and end (*.foo.*), which doesnt exist in the default channel
  • overlapping rules, which only applies to voxmedia.com in the default channel when filtered to exclude rules with @default_off and @platform.
  • rules for IPs; there are two 1.0.0.1 and 1.1.1.1 in the default channel. See https://en.wikipedia.org/wiki/1.1.1.1

https-everywhere-py's People

Contributors

jayvdb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

https-everywhere-py's Issues

Split regex with probable subdomain prefixes into dotted parts

Subdomain analysis is currently done by pre-processing URL-regex to simplify many regexs in the HTTPS Everywhere rules which are variations on matching any subdomain prefix (i.e. [^.]+\.), to allow the regex to be passed to sre_yield without resulting in near-infinite expansions. c.f. google/sre_yield#14

The list of existing prefix regexes can be found at https://github.com/jayvdb/https-everywhere-py/blob/873a5b7/https_everywhere/_fixme.py#L158 , and another non-prefix regexes at https://github.com/jayvdb/https-everywhere-py/blob/873a5b7/https_everywhere/_fixme.py#L202

A partial solution would be to split the regex into subpatterns representing dotted parts, where simple logic can then determine that the first part is a near-infinite expansion. There would be some of the regex which would be computationally difficult to split on the literal \., as it is embedded in multiple complex branches, but these dont need to be split afaics.

There is already rudimentary regex splitting at

def split_regex(pattern, at):

Invalid regex replacements in upstream rules

I'm pretty sure I already fixed this upstream, but possibly hasnt been released yet.

Need to detect if it has been fixed or not, so that I can run the tests in both simplify and default mode.

Or add a simple hack/fallback for default mode so it doesnt break - might be a better approach as they are likely to let more bad data slip in.

self = <tests.test_rules.TestRules testMethod=test_package__<'CIBC'>>
name = 'CIBC'
    @foreach(_get_enabled_rulesets())
    def test_package(self, name):
        ruleset = rulesets[name]
>       self._check_ruleset(ruleset)
tests/test_rules.py:116: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_rules.py:109: in _check_ruleset
    self._check_https(rv)
tests/test_rules.py:89: in _check_https
    self.assertTrue(url.startswith("https://"))
E   AssertionError: False is not true
___________________ TestRules.test_package__<'Canada Post'> ____________________
self = <tests.test_rules.TestRules testMethod=test_package__<'Canada Post'>>
name = 'Canada Post'
    @foreach(_get_enabled_rulesets())
    def test_package(self, name):
        ruleset = rulesets[name]
>       self._check_ruleset(ruleset)
tests/test_rules.py:116: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_rules.py:109: in _check_ruleset
    self._check_https(rv)
tests/test_rules.py:89: in _check_https
    self.assertTrue(url.startswith("https://"))
E   AssertionError: False is not true
__________________ TestRules.test_package__<<'Pickaweb (...>> __________________
self = <tests.test_rules.TestRules testMethod=test_package__<<'Pickaweb (...>>>
name = 'Pickaweb (partial)'
    @foreach(_get_enabled_rulesets())
    def test_package(self, name):
        ruleset = rulesets[name]
>       self._check_ruleset(ruleset)
tests/test_rules.py:116: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_rules.py:103: in _check_ruleset
    rv = https_url_rewrite(test.url, rulesets=reduced_rulesets)
https_everywhere/_rules.py:856: in https_url_rewrite
    new_url = rule[0].sub(rule[1], url)
/usr/local/lib/python3.8/re.py:325: in _subx
    template = _compile_repl(template, pattern)
/usr/local/lib/python3.8/re.py:316: in _compile_repl
    return sre_parse.parse_template(repl, pattern)
/usr/local/lib/python3.8/sre_parse.py:1015: in parse_template
    addgroup(index, len(name) + 1)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
index = 2, pos = 2
    def addgroup(index, pos):
        if index > state.groups:
>           raise s.error("invalid group reference %d" % index, pos)
E           re.error: invalid group reference 2 at position 29
/usr/local/lib/python3.8/sre_parse.py:980: error
------------------------------ Captured log call -------------------------------
WARNING  https_everywhere._rules:_rules.py:858 failed during rule re.compile('^http://support\\.pickaweb\\.co\\.uk/(assets/)') -> https://app.sirportly.com/\g<2> , input http://pickaweb.co.uk/: invalid group reference 2 at position 29

no-ip.com

Mentioned in _rules.py - need re-analysis

Find minimum supported versions of dependencies

tox.ini should have a few combinations testing with older versions of dependencies.

One exception being requests/urllib3, which should have a minimum version that includes the fixes for the latest security vulnerabilities - no exceptions for running older versions of those.

"cached-property" could be made optional - it is only used by the simplification process, and only because of lazy programming.

"appdirs" likely can have quite a low minimum version

"logzero" also could be low, but should also be re-evalutated whether it is a good choice for library logging. (#24)

Merge AppVeyor jobs into single job

AppVeyor images have all versions if Python pre-installed, so tox could run them all.

Tagging and uploading of coverage to codecov is a bit more difficult in that scenario.

Useful in its own right, but also sort of an alternative to
#30

Parse HSTS Google entries

Currently only the includeSubdomain entries of the Chrome HSTS list are loaded. This most notably omits Google sites as they use a different entry structure.

Add type hints everywhere

Python 2.7 is supported, so the newer Python 3 syntax can not be used, or if used, the installation / wheel processes need to strip them.

I've seen a few tools which appear to do the latter, and that would be preferred as the codebase should be "Python 3 first" with horrible hacks/degradation for Python 2.

Build usage example with minimal logging emitted

"logzero" is currently badly configured by default, and emits "info" messages by default which isnt ideal for a library where most people will not care about the inner workings of the rule elimination/simplification process.

Try to control logzero, or switch to a different logging util.

dbs.com

"dbs.com" is one of the few literal domain name specific logic in _rules.py, and should be somehow made generic or deferred by moving the literal into _fixme.py

Add Mozilla HSTS list

The Mozilla HSTS list is useful, if only for comparison, but it also includes negative entries which could add a lot of value.

Reduce duplication in test_adapter

It isnt clear how each adapter functions differently from the others.

TestCase inheritance was used for the everywhere adapters.

It should also be added to the non-everywhere adapters to reduce duplication of identical behaviour.

Replace subdomain wildcard targets with expanded hostnames from rules

Many rulesets have a very limited list of hostnames which match, yet use a *.foo.com target. With the list of hostnames possible extracted from the regex, the wildcard target can be replaced with those hostnames, and often the rule can then also be replaced with a simplified rule.

Add faster CI

I cant enable Travis CI for unknown reasons. Need to investigate.

Going to try Cirrus

regex including hosts which are not in targets

https://github.com/jayvdb/https-everywhere-py/blob/a3f2b42/https_everywhere/_rules.py#L406

There are lots of cases of regex which refer to hosts which are not in the rule targets.

They are currently detected, but are not being rejected, or considered in the tests.
The known ones are stored in _FIXME_BROKEN_REGEX_MATCHES:

_FIXME_BROKEN_REGEX_MATCHES = [
    "affili.de",
    "www.belgium.indymedia.org",
    "m.aljazeera.com",
    "atms00.alicdn.com",
    "i06.c.aliimg.com",
    "allianz-fuercybersicherheit.de",
    "statics0.beauteprivee.fr",
    "support.bulletproofexec.com",
    "wwwimage0.cbsstatic.com",
    "cdn0.colocationamerica.com",
    "www.login.dtcc.edu",
    "ejunkie.com",
    "e-rewards.com",
    "member.eurexchange.com",
    "4exhale.org",
    "na0.www.gartner.com",
    "blog.girlscouts.org",
    "lh0.google.*",  # fixme
    "nardikt.org",
    ".instellaplatform.com",
    "m.w.kuruc.org",
    "search.microsoft.com",
    "static.millenniumseating.com",
    "watchdog.mycomputer.com",
    "a0.ec-images.myspacecdn.com",
    "a0.mzstatic.com",
    "my.netline.com",
    "img.e-nls.com",
    "x.discover.oceaniacruises.com",
    "www.data.phishtank.com",
    "p00.qhimg.com",
    "webassetsk.scea.com",
    "s00.sinaimg.cn",
    "mosr.sk",
    "sofurryfiles.com",
    "asset-g.soupcdn.com",
    "cdn00.sure-assist.com",
    "www.svenskaspel.se",
    "mail.telecom.sk",
    "s4.thejournal.ie",
    "my.wpi.edu",
    "stec-t*.xhcdn.com",  # fixme
    "www.*.yandex.st",  # fixme
    "s4.jrnl.ie",
    "b2.raptrcdn.com",
    "admin.neobookings.com",
    "webmail.vipserv.org",
    "ak0.polyvoreimg.com",
    "cdn.fora.tv",
    "cdn.vbseo.com",
    "edge.alluremedia.com",
    "secure.trustedreviews.com",
    "icmail.net",
    "www.myftp.utechsoft.com",
    "research-store.com",
    "app.sirportly.com",
    "ec7.images-amazon.com",
    "help.npo.nl",
    "css.palcdn.com",
    "legacy.pgi.com",
    "my.btwifi.co.uk",
    "orders.gigenetcloud.com",
    "owa.space2u.com",
    "payment-solutions.entropay.com",
    "static.vce.com",
    "itpol.dk",
    "orionmagazine.com",
    # fix merged, not distributed
    "citymail.com",
    "mvg-mobile.de",
    "inchinashop.com",
    "www.whispergifts",
    # already merged?
    "css.bzimages.com",
    "cdn0.spiegel.de",
]

These are mostly fixed in EFForg/https-everywhere#18949 and EFForg/https-everywhere#18957 , but upstream has difficulty reviewing complex changesets - splitting them might help, but even so the progress on smaller PRs is very slow, so these problems will linger for a while, and need to be fixed in this library.

The regex hosts need to either be tested properly so that the extra hosts can be added to the targets and so be used in the processing logic, and optimised sanely, or the regex should be simplified to remove these extra hosts.

100% coverage

There are already some # pragma: no cover in the code, and a few more will be enough to reach 100% so it can be enforced

Add PyPy to CI

Once covered by CI, PyPy should be added to classifiers in setup.py

Updating of HSTS

The HSTS preload list is fetched once and not updated.

Need to add http caching headers, etc

Add Windows to Cirrus CI

AppVeyor is slow to finish, so at least one Windows job should be run on Cirrus CI.

Then AppVeyor CI can be marked as optional, if that lets the commit go green quicker.

Python 2.7

Python 2.7 should still be achievable

Offline tests

Need to use responses or similar to allow the tests to pass even when offline, and even if the problematic websites used as test caess are 'fixed'.

Make sre_yield dependency optional

Currently the unregex module needs sre_yield master, however it doesnt need to - the code to skip over leading negative assertions is already in unregex.

Also setup.py can have dependency links to force installation using zips, etc.

I dont want to force a release of sre_yield until it is ready, especially if there is some enhancements there in progress which helps this projects needs.

Store simplified rules for quicker restarts

Currently the simplification results are only in memory, and need to be re-done each time. They should be stored on disk for quicker subsequent start times.

Maybe a sqlite db, or emit a json file with the same structure as the published json.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.