Giter VIP home page Giter VIP logo

proofpoint-url-decoder's Introduction

proofpoint-url-decoder

proofpoint-url-decoder is a collection of various Python scripts to unmangle Proofpoint "protected" emails.

Proofpoint Considered Harmful

Some reasons why Proofpoint is harmful (a non-exhaustive list):

  1. Proofpoint makes it harder to read email: by mangling URLs, the user can no longer see what the actual URL is and must blindly trust in a third-party.

    • URLs are visibly mangled and filled with garbage when reading email on the command line using mutt, alpine, or emacs.
  2. Proofpoint erodes your privacy: in addition to someone else scanning your email there are unique identifiers embedded in each mangled URL that tie each URL to an individual user, and Proofpoint will independently visit (and possibly even crawl) the URL after the user has clicked on it.

Usage

Each program can be used standalone: pick and use the Python script that is most relevant to your use case.

There are several files of note:

  • decode.py: reads a URL as an input parameter, outputs a clean URL to STDOUT

    Example:

    $ set +H   # disable ! history substitution
    $ ./decode.py "https://urldefense.com/v3/__http://www.example.com__;!!foo!bar$"
    http://www.example.com
  • get_urls.py: reads as input an email (from STDIN), extracts and outputs clean URLs to STDOUT

  • decode_email.py: reads as input an email (from STDIN), and outputs the same email with clean URLs to STDOUT

    Example:

    $ cat email_message | ./decode_email.py > email_message.cleaned

decode_email.py

usage: decode_email.py [-h] [--plaintext] [--preserve-mbox-from]

decode proofpoint-mangled URLs in emails

options:
  -h, --help            show this help message and exit
  --plaintext, -p       decode URLs in plaintext input (not an email message)
  --preserve-mbox-from, -m
                        Preserve the mbox format email separator (From <addr> <timestamp>) on the first line

Integrating with Mail Delivery Agents

decode_email.py can be integrated with fdm and procmail to automatically filter and unmangle URLs before being delivered to your inbox.

fdm

Add the following rules to your .fdm.conf:

# An action to save to the maildir ~/Mail/inbox.
action "inbox" maildir "%h/Mail/inbox"
action "backup" maildir "%h/Mail/backup"

# Un-mangle ProofPoint URLs
action "unmangle" rewrite "/path/to/proofpoint-url-decoder/decode_email.py"

# (optional) keep a backup of all email
match all action "backup" continue

# 1. match all mail
# 2. run the "unmangle" action on each message (rewrite URLs)
# 3. run the "inbox" action on the resulting message (deliver to Maildir)
match all action "unmangle" continue
match all action "inbox"

Watch your log file (.fdm.log) for any issues. If you're processing a lot of mail at any one time, you may have to configure additional settings in .fdm.conf: see man 5 fdm.conf for more information.

procmail

Add the following rule near the beginning of your .procmailrc:

:0 fw
| /path/to/proofpoint-url-decoder/decode_email.py

You could match on and filter emails containing the X-Proofpoint-* header (which would be all emails on systems), but sometimes you will get emails forwarded to you that might not have this header and still contain the mangled URLs.

It's a good idea to keep a backup copy of the emails, in case something in the processing pipeline goes wrong:

# copy all mail to the "backup" Maildir
:0c
backup/

# pipe message through decode_email.py
:0 fw
| /path/to/proofpoint-url-decoder/decode_email.py

# write resulting email into "inbox" Maildir
:0:
inbox/

You could also run decode_email.py on a copy of the email to test its functionality:

# create a working copy
:0c
{
    # pipe message through decode_email.py
    :0 fw
    | /path/to/proofpoint-url-decoder/decode_email.py

    # write resulting email into "testing" Maildir
    :0:
    testing/
}

Tests

There are some unit tests, with some library dependencies:

pip install -r requirements.txt
python3 -v decode_test.py

There are also some procmail tests: see procmail/.

Contributing

Feel free to contribute code or send comments, suggestions, bugs to [email protected].

Development Notes and Roadmap

For now, to keep each script independent, decode_ppv2() and decode_ppv3 are duplicated in each script.

LICENSE

CC0 1.0 Universal

proofpoint-url-decoder's People

Contributors

cardi avatar drdaved avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

proofpoint-url-decoder's Issues

decode outlook safelinks protection

Similar to #5, Outlook will rewrite URLs with *[.]safelinks[.]protection[.]outlook[.]com, so at some point you get an abomination of multiple entities rewriting each others' URLs.

We should find and unwrap these, to a certain limit.

(At some point, a URL could be rewritten so many times that it takes up more space/bandwidth/processing than the original email itself.)

Cisco secure web?

Damn. My employer has changed the "secure" links provider, and now it's Cisco. The URLs have the following shape:

https://secure-web.cisco.com/1W9jhe2SGm2BNitIIaautca8rNFg8x1HzdiXH2nqdTHek8f3H2xv8js8dm9EVu3HRSeIAkMj6c2zwWFmrcG8XKsupK8sSz5j8Zog1At25XnpzkZ6gPXk6y_O4oqFgmV_OesoEEurqTsYFv_GeckTqxJ5ThIWtTBbiLD1r4AX8PGJuDI7rRGT22a-W8kVsXnYUr1LvMrOQnSufLQ5EJ3Fb95jONCil7uSQ_e0YNOA0ErMVvlvOQis-bWdOSNxEXZU1st6Ud_NKGOudW7_GI7IK_FYfJl3j-gkbzf25eF2X1KI/https%3A%2F%2Fmailchi.mp%2Fenqa%2Fenqa-bulletin-jun2024%3Fe%3D65775e6286

Looks to me that they are easier to spot than Proofpoint's, but my absolute lack of regex knowledge won't allow me to filter them adecuately. Is there any chance to add these type of urls to the url-decoder?

URLs mangled multiple times

Sometimes the URLs are mangled multiple times, e.g.,

https://urldefense[.]us/v2/url?u=https-3A__urldefense[.]com_v3_

It used to be the case that if a URL already had the form of http://urldefense[.]com/..., the middlebox wouldn't touch it (though would it change any of the user/org-based parameters?).

This behavior could be the result of migrating from (or incorporating) their multiple domains or the mixing of v3 and v2 versions.

The solution for now should be to recursively decode until the URL is clean, up to some limit (maybe 5?).

Extra space in urldefense v3

When decoding emails with URLs mangled with urldefense v3, there is a little glitch: urldefense seems to add always a space after the URL, but this space is kept by decode_email.py. For example, if someone sends an email with <https://example.com>, the decoded email will have <https://example.com >. This is not absolutely a big deal, but perhaps the solution is easy enough.

Thank you very much for this very useful piece of software!

non-ASCII characters in the URL breaks the decoder

I love this tool and it has become super handy. There's an email
it cannot process, however. I've traced the problem down to this
url:

<https://urldefense.com/v3/__https://es.linkedin.com/in/pastora-mart**Anez-samper-3049818__;w60!!D9dNQwwGXtA!VCWZpmS3jIZRe5xmjoL0_MX9ZUWwcPFyz6CnsBeDuRVdXGZbS1iTRYZ19iLt7VCJmmzf528Jpli6tlesUnQ$ >

Proofpoint-url-decoder gives me:

Traceback (most recent call last):
  File "/home/me/src/proofpoint-url-decoder/decode_email.py", line 388, in <module>
    e_clean = process_text(e)
              ^^^^^^^^^^^^^^^
  File "/home/me/src/proofpoint-url-decoder/decode_email.py", line 354, in process_text
    e_clean = re.sub(
              ^^^^^^^
  File "/usr/lib/python3.11/re/__init__.py", line 185, in sub
    return _compile(pattern, flags).sub(repl, string, count)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/src/proofpoint-url-decoder/decode_email.py", line 357, in <lambda>
    decode(match.group()) if "urldefense" in match.group() else match.group()
    ^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/src/proofpoint-url-decoder/decode_email.py", line 295, in decode
    cleaned_url = decode_ppv3(mangled_url, unquote_url)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/src/proofpoint-url-decoder/decode_email.py", line 243, in decode_ppv3
    replacement_chars.append(replacement_list.pop(0))
                             ^^^^^^^^^^^^^^^^^^^^^^^
IndexError: pop from empty list

What do you think? It is indeed a weird url, since my bash
terminal does not recognise it if I paste it. The "**A" bit of the
url corresponds to the accented char "í", so maybe that's the
problem...

decode parameters

A Proofpoint URL takes the format of the following:
urldefense.proofpoint.com/v2/url?[params]

where [params] consists of the following:

c := constant (per organization)
d := constant (per organization)
e := always empty?
m := ?
r := unique identifier tied to email address
s := ?
u := safe-encoded URL

m might be a hash of the original URL or some metadata
s might be a signature or checksum

Disturbingly, each URL contains a unique identifier that's tied to the individual's email address, letting the organization (that hosts the email) and Proofpoint logs when a particular user clicks on a URL and quite possibly crawl the target content.

It would be nice to understand what each parameter does and how it's encoded (or, if possible, how to decode it).

urldefense v3: new encoding

Example URL: https://urldefense[.]com/v3/__https://contact.framasoft.org/*newsletter__;Iw!!LIr3w8kk_Xxm!6BNqFLJ13q7N5_lf3XQFlmTtgY5CkKjhfcIn4ybAhA1_gx_y07jmQ4uvR2QZ$

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.