Giter VIP home page Giter VIP logo

url-cleaner's Introduction

URL Cleaner

Websites often put unique identifiers into URLs so that, when you send a link to a friend and they open it, the website knows it was you who sent it to them.
As most people do not understand and therefore cannot consent to this, it is polite to remove the spyware query parameters before sending URLs to people.
URL Cleaner is an extremely versatile tool designed to make this process as comprehensive, fast, and easy as possible.

C dependencies

These packages are required on Kubuntu 2024.04 (and probably therefore all Debian based distros.):

  • libssl-dev
  • libsqlite3-dev

There are likely plenty more dependencies required that various Linux distros may or may not pre-install.

If you can't compile it I'll try to help you out. And if you can make it work on your own please let me know so I can add to this list.

Anonymity

In theory, if you're the only one on a website sharing posts from that website without URL trackers, the website could realize that and track you in the same way ("if no trackers then assume sender is Jared".).
In practise you are very unlikely to be the only one sharing clean URLs. Search engines generally provide URLs without trackers[citation needed], some people manually remove trackers, and some websites like vxtwitter.com automatically strip URL trackers.

However, for some websites the default config strips more stuff than search engines. In this case anonymity does fall back to many people using URL Cleaner and providing cover for each other.

As with Tor, protests, and anything else where privacy matters, safety comes in numbers.

Canonicalization

There is a vague notion of "canonical" URLs. One where every URL with its semantics can be converted into the canonical form with no external details.

For example, https://amazon.com/product-name-here/dp/PRODUCTID can be "canonicalized" to https://amazon.com/dp/PRODUCTID.

I have not yet decided if I want to remove the minimize flag and/or just make it the default behavior.

Input on the matter would be appreciated.

Basic usage

By default, compiling URL Cleaner includes the default-config.json file in the binary. Because of this, URL Cleaner can be used simply with url-cleaner "https://example.com/of?a=dirty#url".

The default config

The default config is intended to always obey the following rules:

  • "Meaningful semantic changes"[definition?] should only ever occur as a result of a flag being enabled.
    • Insignificant details like the navbar amazon listings have full of links to item categories being slightly different are, as previously stated, insignificant.
  • URLs that are "semantically valid"[definition?] shouldn't ever throw an error.
    • URLs that aren't semantically valid also shouldn't ever throw an error but that is generally less important.
    • URLs that are semantically valid should never change semantics and/or become semantically invalid.
      • URLs that are semantically invalid may become semantically valid if there is an obvious way to do so (re: unmangle flag).
  • Outside of long (>10)/infinite redirects/shortlinks, it should always be idempotent.
  • The command and debug features, as well as any features starting with experiment-/experimental- are never expected to be enabled. The command feature is enabled by default for convenience but, for situations where untrusted/user-provided configs have a chance to be run, should be disabled.
  • All caching is expected to be deterministic.
    • Also the rest of the default config is expected to be deterministic but when nothing's being cached corrections don't cause problems.
    • Usually redirect links don't randomly change (looking at you, goo.gl).
    • The onion-location flag does throw a minor wrench into this but whatever.

Currently no guarantees are made, though when the above rules are broken it is considered a bug and I'd appreciate being told about it.

Additionally, these rules may be changed at any time for any reason. Usually just for clarification.

Flags

Flags let you specify behaviour with the --flag name --flag name2 command line syntax.

Various flags are included in the default config for things I want to do frequently.

  • no-https-upgrade: Disable replacing http:// with https://.
  • no-http: Don't make any HTTP requests.
  • assume-1-dot-2-is-shortlink: Treat all that match the Regex ^.\...$ as shortlinks. Let's be real, they all are.
  • no-unmangle: Disable all unmangling.
  • no-unmangle-host-is-http-or-https: Don't convert https://https//example.com/abc to https://example.com/abc.
  • no-unmangle-path-is-url: Don't convert https://example1.com/https://example2.com/user to https://example2.com/abc.
  • no-unmangle-second-path-segment-is-url: Don't convert https://example1.com/profile/https://example2.com/profile/user to https://example2.com/profile/user.
  • no-unmangle-subdomain-ends-in-not-subdomain: Don't convert https://profile.example.com.example.com to https://profile.example.com.
  • no-unmangle-subdomain-starting-with-www-segment: Don't convert https://www.username.example.com to https://username.example.com.
  • unmobile: Convert https://m.example.com, https://mobile.example.com, https://abc.m.example.com, and https://abc.mobile.example.com into https://example.com and https://abc.example.com.
  • minimize: Remove non-essential parts of that are likely not tracking related.
  • youtube-unshort: Turns https://youtube.com/shorts/abc into https://youtube.com/watch?v=abc.
  • discord-unexternal: Replace images-ext-1.discordapp.net with the original images they refer to.
  • discord-compatibility: Sets the domain of twitter.com to the domain specified by the twitter-embed-domain variable.
  • deadname-twitter: Change x.com to twitter.com.
  • breezewiki: Sets the domain of fandom.com and BreezeWiki to the domain specified by the breezewiki-domain variable.
  • unbreezewiki: Turn BreezeWiki into fandom.com.
  • tor2web: Append the suffix specified by the tor2web-suffix variable to .onion domains.
  • tor2web2tor: Replace **.onion.** domains with **.onion domains.
  • bypass.vip: Use bypass.vip to expand linkvertise links. Currently untestable as the API is down.

If a flag is enabled in a config's "params" field, it can be disabled using --unflag flag1 --unflag flag1.

Variables

Variables let you specify behaviour with the --var name=value --var name2=value2 command line syntax.

Various variables are included in the default config for things that have multiple useful values.

  • twitter-embed-domain: The domain to use for twitter when the discord-compatibility flag is specified. Defaults to vxtwitter.com.
  • breezewiki-domain: The domain to use to turn fandom.com and BreezeWiki into BreezeWiki. Defaults to breezewiki.com
  • tor2web-suffix: The suffix to append to the end of .onion domains if the flag tor2web is enabled. Should not start with . as that's added automatically. Left unset by default.
  • bypass.vip-api-key: The API key used for bypass.vip's premium backend. Overrides the URL_CLEANER_BYPASS.VIP_API_KEY environment variable.

If a variable is specified in a config's "params" field, it can be unspecified using --unvar var1 --unvar var2.

Environment variables

There are some things you don't want in the config, like API keys, that are also a pain to repeatedly insert via --key abc=xyz. For this, URL Cleaner does make use of native environment variables.

  • URL_CLEANER_BYPASS.VIP_API_KEY: The API key used for bypass.vip's premium backend. Can be overridden with the bypass.vip-api-key variable.

Sets

Sets let you check if a string is one of many specific strings very quickly.

Various sets are included in the default config.

  • https-upgrade-host-blacklist: Hosts to not upgrade from http to https even when the no-https-upgrade flag isn't enabled.
  • shortlink-hosts: Hosts that are considered shortlinks in the sense that they return HTTP 3xx status codes. URLs with hosts in this set (as well as URLs with hosts that are "www." then a host in this set) will have the ExpandShortLink mapper applied.
  • utps-host-whitelist: Hosts to never remove universal tracking parameters from.
  • utps: The set of "universal tracking parameters" that are always removed for any URL with a host not in the utp-host-whitelist set. Please note that the UTP rule in the default config also removes any parameter starting with cm_mmc, __s, at_custom, and utm_ and thus parameters starting with those can be omitted from this set.
  • unmangle-path-is-url-host-whitelist: Effectively the no-unmangle-path-is-url for specified hosts.
  • unmangle-subdomain-ends-in-not-subdomain-not-subdomain-whitelist: Effectively no-unmangle-subdomain-ends-in-not-subdomain-not-subdomain-whitelist for specified not subdomains.
  • breezewiki-hosts: Hosts to replace with the breezewiki-domain variable when the breezewiki flag is enabled. fandom.com is always replaced and is therefore not in this set.
  • lmgtfy-hosts: Hosts to replace with google.com.

Sets can have elements inserted into them using --insert-into-set name1 value1 value2 --insert-into-set name2 value3 value4.

Sets can have elements removed from them using --remove-from-set name1 value1 value2 --remove-from-set name2 value3 value4.

Lists

Lists allow you to iterate over strings for things like checking if another string contains any of them.

Currently only one list is included in the default config:

  • utp-prefixes: If a query parameter starts with any of the strings in this list (such as utm_) it is removed.

Currently there is no command line syntax for them. There really should be.

Citations

The people and projects I have stolen various parts of the default config from.

And by that I basically just mean the UTP set. And the UTP prefix list.

Custom rules

Although proper documentation of the config schema is pending me being bothered to do it, the url_cleaner crate itself is reasonably well documented and the various types are (I think) fairly easy to understand.
The main files you want to look at are conditions.rs and mappers.rs.
Additionally url_part.rs, string_location.rs, and string_modification.rs are very important for more advanced rules.

Footguns

There are various things in/about URL Cleaner that I or many consider stupid but for various reasons cannot/should not be "fixed". These include but are not limited to:

  • For UrlParts and Mappers that use "suffix" semantics (the idea that the '.co.uk" in "google.co.uk" is semantically the same as the ".com" in "google.com"'), the psl crate is used which in turn uses Mozilla's Public Suffix List. Various suffixes are included that one may expect to be normal domains, such as blogspot.com. If for some reason a domain isn't working as expected, that may be the issue.

Reference for people who don't know Rust's syntax:

Additionally, regex support uses the regex crate, which doesn't support look-around and backreferences.
Certain common regex operations are not possible to express without those, but this should never come up in practice.

MSRV

The Minimum Supported Rust Version is the latest stable release. URL Cleaner may or may not work on older versions, but there's no guarantee.

If this is an issue I'll do what I can to lower it but Diesel also keeps a fairly recent MSRV so you may lose caching.

Untrusted input

Although URL Cleaner has various feature flags that can be disabled to make handling untrusted input safer, no guarantees are made. Especially if the config file being used is untrusted.
That said, if you notice any rules that use but don't actually need HTTP requests or other data-leaky features, please let me know.

CLI

Parsing output

Note: JSON output is supported.

Unless Mapper::(e|)Print(ln|) or a Debug variant is used, the following should always be true:

  1. Input URLs are a list of URLs starting with URLs provided as command line arguments then each line of the STDIN.

  2. The nth line of STDOUT corresponds to the nth input URL.

  3. If the nth line of STDOUT is empty, then something about reading/parsing/cleaning the URL failed.

  4. The nth non-empty line of STDERR corresponds to the nth empty line of STDOUT.

    1. Currently empty STDERR lines are not printed when a URL succeeds. While it would make parsing the output easier it would cause visual clutter on terminals. While this will likely never change by default, parsers should be sure to follow 4 strictly in case this is added as an option.

JSON output

The --json/-j flag can be used to have URL Cleaner output JSON instead of lines.

The exact format is currently in flux.

If a Mapper::Print(ln|) is used, this is not guaranteed to be valid JSON.

Panic policy

URL Cleaner should only ever panic under the following circumstances:

  • Parsing the CLI arguments failed.

  • Loading/parsing the config failed.

  • Printing the config failed. (Shouldn't be possible.)

  • Testing the config failed.

  • Reading from/writing to STDIN/STDOUT/STDERR has a catastrophic error.

  • Running out of memory resulting in a standard library function/method panicking. This should be extremely rare.

  • (Only possible when the debug feature is enabled) The mutex controlling debug printing indenting is poisoned and a lock is attempted. This should only be possible when URL Cleaner is used as a library.

Outside of these cases, URL Cleaner should never panic. However as this is equivalent to saying "URL Cleaner has no bugs", no actual guarantees can be made.

Funding

URL Cleaner does not accept donations. If you feel the need to donate please instead donate to The Tor Project and/or The Internet Archive.

url-cleaner's People

Contributors

scripter17 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

url-cleaner's Issues

`https://www.paypal.com/webapps/mch/cmd/?v=3.0&t=$UNIX&fdata=$STUFF`

  • fdata seems to be a list of base64 streams ("segments") delimited by ..
  • The example I have (from a "you sent a payment" receipt email) redirects to https://www.paypal.com/cgp/mgm/referrer?intent=mktg&tsrce=txn.
  • It seems the first segment is about 4/3x the length of the final URL but base64 decoding with start-truncating 0-3 characters doesn't seem to work.
  • The fifth segment is also about 4/3x the length of the final URL but again I've yet to find any way to decode it.
  • The second segment is very large and may contain the answer?

Condition mapper state

For https://a.com/https://a.com unmangling and the unmobile flag in the next commit, it'd be very handy to have the condition tell the mapper in which path/domain segment a certain thing was found

I guess I could let a rule contain a hashmap that gets reset any time the rule is run and conditions and mappers that read/write to that hashmap?

{"state": {"a": 0, "b": "xyz"}, ...}?

Per-URL params

It'll be useful for url-cleaner-site to send details of <a> elements like attributes and innerText

For example it seems it can be used to completely bypass ExpandShortlinking t.co links on twitter

DuckDuckGo bang handler

In the incredibly optimistic future where a browser uses URL Cleaner, this'll be very handy

A generic `StringSource` API

enum Mapper {
    SetPart {
        part: UrlPart,
        value: StringSource
    }
}

enum StringSource {
    String(String),
    Part(UrlPart),
    Variable(String)
}

impl StringSource {
    fn get_string<'a>(&'a self, url: &Url, params: &Params) -> Option<&'a str> {
        todo!()
    }
}

This allows removing Mapper::CopyPart and MapperSetPartToVar as well as more a versatile Condition::PartIs.

Pending either serde or serde_as allowing FromStr fallback on a per-type basis instead of just a per-field basis.

Status codes

  • How handle status code when multiple URLs
    • 0..=16 => bitflags?
    • 16.. => ???
  • Require only 1 URL for status code?
  • Return 1 if any error?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.