whotracksme / whotracks.me Goto Github PK

View Code? Open in Web Editor NEW

397.0 37.0 73.0 694.21 MB

Data from the largest and longest measurement of online tracking.

Home Page: https://whotracks.me

License: MIT License

Python 0.85% JavaScript 0.20% HTML 0.62% Jupyter Notebook 97.01% Dockerfile 0.01% Less 0.49% SCSS 0.81% Shell 0.01%

tracking trackers transparency privacy privacy-tools cliqz ghostery

whotracks.me's People

Contributors

Stargazers

Watchers

whotracks.me's Issues

Downloading data / cloning a repository

Hi guys,

I'm receiving a following error message after cloning this repo: "This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access."

I'm attaching the log file here: 20201102T112956.908151.log

Above error persist even after: git lfs fetch

Bascially all the files in the whotracksme/data/assets/... folders will not be converted from their LFS poiners to real .csv files.

How can one access raw .csv files that were once available in assets/data/... folders?

Add sites table to WTM sqlite

Not found page does not display properly on sub paths

If a 404 is triggered on a page not at the root of the site, the stylesheets will not load correctly.

https://whotracks.me/something renders correctly, but https://whotracks.me/tracker/something does not. This is because we try to address the static directly with a relative path, with the assumption that this page will always be loaded at the site root.

Links from tracker/website pages to blog posts mentioning them

When a tracker/website is mentioned in a blog post, we could auto-generate a section on the tracker page 'Posts mentioning this page'.

Allow autocomplete on domains under which a specific tracker operates under

Use case:

I am looking at all 3rd party calls from a specific website.
I see a domain i don't recoginze, i.e. wt1.rqtrk.eu
I would like to be able to look for it in the search box and see which company it belongs to.

Add guidelines for 3rd parties wanting to suggest corrections to their data.

Add robots.txt file

Changes in data collection since 2019?

Hi,
I started to look at the trackers data and ran into some observations that made me wonder if there have been changes since 2019 to how this data is collected.

There seems to be missing data for 'has_blocking' in the trackers table from Jan2019 - May2019. Was there are change in instrumentation/user setup during this time? Is it ok to consider the remaining data during this period.
I noticed that the use of cookies (averaged over all trackers per month) is steadily decreasing from 2019 to 2021, which isn't intuitive to me. Has this been observed/understood? If there is any literature on this, could you please point me to it?
Is it possible to obtain the number of users or page loads from each country each month?
thanks!

"most tracked websites" tables are identical

On the home page, in the section "The most tracked websites", there is the option to display by "traffic" or "Average number of trackers". These two currently display exactly the same content (which is average number of trackers).

We should show something different under "traffic", or not show it at all.

Hide referrer on out links

Use rel="noreferrer" on links to external sites.

Data does not seem to be shipped with whotracksme from pypi

There seems to be an issue with whotracksme when installed from pypi: the data is not packaged with the code. Also, would be nice to automatically update a new version on pypi every time the data is updated, this should be possible from travis.

Need clarity on licensing terms

Hi,

I have a query related to the licensing terms mentioned in this project. As licensing description is mentioned as below:
"The content of this project itself is licensed under the Creative Commons Attribution 4.0 license, and the underlying source code used to generate and display that content is licensed under the MIT license."

Q1. Does it mean that all the data, including third party/tracker information present at https://github.com/cliqz-oss/whotracks.me/tree/master/whotracksme/data/assets, are covered under Creative Common Attribution 4.0 license?

Q2. Can I also use the data present here for some experiments in my personal project?

Regards,
Ethan

Update description under Tracking Methods for a website that does not track

Exaple: https://whotracks.me/websites/sofort.com.html

Adguard clarification

Hi!

I've just found out that Adguard is listed as a tracker on whotracksme: https://whotracks.me/trackers/adguard.html

This is not quite true, but I can see where it comes from. Let me please clarify the situation.

AdGuard for Windows/Mac is a network level content blocker so it cannot simply add custom JS/CSS to webpages like what browser extensions do.
In order to do it, it injects a content script: <script src="https://local.adguard.com/blahblah/content-script.js"> that takes care of cosmetic rules.
Connections to local.adguard.com are intercepted by the network driver and processed locally. Also, we changed the domain to local.adguard.org in the newer versions.
This is a usual approach for network-level software. For instance, you have Kaspersky listed as a tracker because of the very same thing -- they add a content script to every page.

What's important here:

There are no remote connections, everything is processed locally
There is no tracking, fingerprinting or whatever.

database update

Hi!
I installed project from pip and after

data = DataSource()

shows:

data available for months: ['2017-05', '2017-06', '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12', '2018-01', '2018-02', '2018-03', '2018-04']
Is there a way to download latest db or I just have to manually download db and then load the data?

Revisit "Generating Ad-Blocker filters .. " post to reflect recent refactoring of the codebase.

Library version conflicts when installing from source

I tried to install whotracks.me from source but there were library version conflicts
How to reproduce:

 $  git clone https://github.com/ghostery/whotracks.me.git
 $  cd whotracks.me
 $  conda create -n whotracksme python=3.8
 $  conda activate whotracksme
 $  pip install -r requirements.txt
...
ERROR: Cannot install -r requirements.txt (line 9) and urllib3==1.26.5 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested urllib3==1.26.5
    requests 2.24.0 depends on urllib3!=1.25.0, !=1.25.1, <1.26 and >=1.21.1

My system: Ubuntu 18.04.

A quick fix: update the version of requests in requirements.txt:

--- a/requirements.txt
+++ b/requirements.txt
@@ -6,6 +6,6 @@ numpy==1.19.1
 pandas==1.1.2
 python-dateutil==2.8.1
 pytz==2020.1
-requests==2.24.0
+requests==2.25.1
 six==1.15.0
 urllib3==1.26.5

generate a list of top 100 most common trackers that are not in Ghostery Tracker DB

Broken 404 layout

whotracks.me has a 404 page, example:
https://whotracks.me/sadadas

But it breaks on url structure like:
https://whotracks.me/invalid/sadadas
https://whotracks.me/websites/lufthansa.com

Conserve Git LFS bandwidth

Currently, we exceed our Git LFS limits relatively quickly. Some ideas to reduce the amount of downloaded data:

By default, download only the most recent data (configure lfs.fetchinclude / lfs.fetchexclude, see git-lfs/git-lfs#2717). New data per month takes around 250M, while the whole data set is currently around 6,7G (and will grow with each month).
Compress the csv files (the expected compression ratio with xz is 20%-25%)

Small plotting tweaks to get website to build

Building the website today I needed to do the following tweaks to get the website to build. I encountered three errors:

True is not a valid entry for height
Invalid color #00000000
autotick not a valid option

Here's the diff. I'm happy to submit as a PR if it's useful, but thought I would post first.

I have also included the output of pip freeze. I have higher version numbers for most packages compared to what's pinned in your requrements-dev.txt. I'm not sure why that happened, I used your instructions pip install -e '.[dev]' to install requirements.

diff --git a/whotracksme/website/plotting/colors.py b/whotracksme/website/plotting/colors.py
index 1dbaf65..3613533 100644
--- a/whotracksme/website/plotting/colors.py
+++ b/whotracksme/website/plotting/colors.py
@@ -8,7 +8,7 @@ cliqz_colors = {
     "white": "#FFFFFF",
     "bright_gray": "#BFCBD6",
     "inactive_gray": "#BCC4CE",
-    "transparent": "#00000000",
+    "transparent": "rgba(0,0, 0, 0)",
     "green": "#50B1A2",
     "red": "#C3043E",
     "yellow": "#FFC802",
diff --git a/whotracksme/website/plotting/companies.py b/whotracksme/website/plotting/companies.py
index 6015ab7..4e58c77 100644
--- a/whotracksme/website/plotting/companies.py
+++ b/whotracksme/website/plotting/companies.py
@@ -6,7 +6,7 @@ from whotracksme.website.plotting.plots import scatter
 from whotracksme.website.plotting.colors import random_color, biggest_tracker_colors, cliqz_colors
 
 
-def overview_bars(companies, highlight=2, custom_height=True):
+def overview_bars(companies, highlight=2, height=None):
     x = []
     y = []
     colors = [cliqz_colors["purple"]] * highlight + [cliqz_colors["inactive_gray"]] * (len(companies) - highlight)
@@ -29,7 +29,7 @@ def overview_bars(companies, highlight=2, custom_height=True):
             margin=set_margins(t=30, l=150),
             showlegend=False,
             autosize=True,
-            height=custom_height if custom_height else None,
+            height=height,
             xaxis=dict(
                 color=cliqz_colors["gray_blue"],
                 tickformat="%",
diff --git a/whotracksme/website/plotting/trackers.py b/whotracksme/website/plotting/trackers.py
index 74ea953..4fe7fd5 100644
--- a/whotracksme/website/plotting/trackers.py
+++ b/whotracksme/website/plotting/trackers.py
@@ -133,7 +133,6 @@ def ts_trend(ts, t):
                 showgrid=False,
                 zeroline=False,
                 showline=False,
-                autotick=True,
                 hoverformat="%b %y",
                 ticks='',
                 showticklabels=False
@@ -143,7 +142,6 @@ def ts_trend(ts, t):
                 showgrid=False,
                 zeroline=False,
                 showline=False,
-                autotick=True,
                 ticks='',
                 showticklabels=False
             )

aiofiles==0.4.0
aiohttp==3.4.4
argh==0.26.2
async-timeout==3.0.1
atomicwrites==1.2.1
attrs==18.2.0
bleach==3.0.2
boto3==1.9.27
botocore==1.12.27
certifi==2018.10.15
cffi==1.11.5
chardet==3.0.4
cmarkgfm==0.4.2
colour==0.1.5
decorator==4.3.0
docopt==0.6.2
docutils==0.14
future==0.16.0
httptools==0.0.11
idna==2.7
ipython-genutils==0.2.0
Jinja2==2.10
jmespath==0.9.3
jsonschema==2.6.0
jupyter-core==4.4.0
libsass==0.15.1
Markdown==3.0.1
MarkupSafe==1.0
more-itertools==4.3.0
multidict==4.4.2
nbformat==4.4.0
numpy==1.15.2
pandas==0.23.4
pathtools==0.1.2
pkginfo==1.4.2
plotly==3.3.0
pluggy==0.8.0
py==1.7.0
pycparser==2.19
Pygments==2.2.0
pytest==3.9.1
python-dateutil==2.7.3
pytz==2018.5
PyYAML==3.13
readme-renderer==22.0
requests==2.20.0
requests-toolbelt==0.8.0
retrying==1.3.3
s3transfer==0.1.13
sanic==0.8.3
six==1.11.0
squarify==0.3.0
tqdm==4.27.0
traitlets==4.3.2
twine==1.12.1
ujson==1.35
urllib3==1.23
uvloop==0.11.2
watchdog==0.9.0
webencodings==0.5.1
websockets==5.0.1
-e git+https://github.com/cliqz-oss/whotracks.me.git@ecc99318a7323f4eb0c765c2412ddabdf3e2f633#egg=whotracksme
yarl==1.2.6

Filtering buttons for 'Presence on top sites' do not work properly

Visit a page of any tracker
In 'Presence on top sites' section click on one of the buttons to filter the trackers by category, eg 'Adult'
Click on the same button again

expected result: the filter for this category should be removed
actual result: nothing happens

Add a link to Github repo

We should link to this repo from the site.

RSS?

I want to keep up with the blog but I don't want to add it on any social media, RSS would be great. Could you add RSS feed? Thank you.

Correct classification of contentpass

I'm CTO and Co-Founder of contentpass

I believe that the current record about our service at https://whotracks.me/trackers/contentpass.html misrepresents what contentpass is actually doing and I would love to see the record corrected.

contentpass offers cross-publisher subscriptions as well as consent management for news publishers. Publishers can make use of our product by including a JavaScript library which is hosted on our domain contentpass.net

Part of our solution is a statistics endpoint that helps publishers measure the usage of our product. Our JavaScript library sends measurement-signals to the stats endpoint on sites where our solution is being used, the endpoint is located at: https://api.contentpass.net/stats

We believe our solution is specifically interesting for privacy-aware users since – among others – it allows users to support publishers by paying a monthly subscription fee in exchange for removal of banner ads and tracking on the participating publisher sites. We're currently in the rollout process of our service: While the cross-publisher subscriptions are not available to the public yet, we're already integrating with many publishers and we're planning public availability within the next months. This is why our domains already appear on the whotracks.me database.

We have designed our service based on the principles of privacy-by-design and privacy-by-default. Among others, we take the following measures to protect the privacy of users visiting publisher websites where our service is implemented:

we do not store any IP addresses (not even truncated IP addresses).
we do not collect any personally identifiable information (PII).
we do not collect any sort of unique user identifiers (uid) that would allow reconstruction of a browsing session.
we do not collect device information which would allow device fingerprinting (i.e. no screen resolution, no information about installed plugins, etc.).
we do not perform any cross-domain and/or 3rd-party tracking.

By the way, many of our design decisions were influenced by the "Data Collection without Privacy Side-Effects" Paper.

We're also adhering to the EFF Do Not Track (DNT) Policy as we've recently announced on our blog.

We believe that transparency is important and we therefore value your efforts with the whotracks.me database. However we also think that information shown there should be correct and not misleading.

We would therefore like to ask for our information to be corrected:

The record currently claims that we employ fingerprinting. This is not correct, we do not perform any fingerprinting, have never done so in the past and will never do it in the future. In fact, we do not even do anything that would qualify as "tracking" in the terms defined in the What is a tracker? blog post.
We're currently categorized as "Advertising", however we're doing quite the opposite: We're offering subscriptions for ad-free and premium access to publisher websites, as well as consent management. We think that the category "Essential" would describe much better what we're doing.

If you need any additional information I'm happy to provide it here. I also hope that opening an issue was the correct way of addressing this (since #51 is still open).

You have been added to awesome-humane-tech

This is just a FYI issue to notify that you were added to the curated awesome-humane-tech in the 'Awareness' category, and - if you like that - are now entitled to wear our badge:

By adding this to the README:

[![Awesome Humane Tech](https://raw.githubusercontent.com/humanetech-community/awesome-humane-tech/main/humane-tech-badge.svg?sanitize=true)](https://github.com/humanetech-community/awesome-humane-tech)

https://github.com/humanetech-community/awesome-humane-tech

Display last updated date on tracker and website pages

Put the build date onto tracker and website pages so the freshness of the data is indicated.

Call for Contribution: Third party partners

The idea is to list partner networks third parties are part of, and do this for as many third parties as possible.

Eg: Acxiom Case

Name: Acxiom
Profile on Whotracks.me: https://whotracks.me/trackers/acxiom.html
Website: https://www.acxiom.com/
Partners URL: https://www.acxiom.com/partners/

Feel free to use this format for new entries in the comments below.

A few more trackers for whotracksme

Hi there! We've been accumulating some information on known trackers, and I guess it might be useful to you.

I've tried to make a proper pull request at first, but it's not that easy to convert what we have in your format:) Please take a look at the Doc, I've marked trackers that are missing or incomplete on whotracks.me.

https://docs.google.com/spreadsheets/d/19yJoE2UQ3eh1Gd26YMtWASBp7gh0xOBsNZr3xouggdI/edit?usp=sharing

Expand tracker database fields

The informational fields on trackers should be expanded to provide richer information on each entity. Fields to be added:

Operating country
Privacy contact / Data protection officer
Description/in their own words: A short description of what the tracker/company does which can be displayed on the website.

Missing dependency on requests

Was just doing a clean setup before making a PR and I noticed that requests should be in the main requirements.txt, not requirements-dev.txt.

To reproduce:

# Make env
$ pip install whotracksme
$ python

Python 3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:08:43)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux                                                                                                  Type "help", "copyright", "credits" or "license" for more information.
>>> from whotracksme.data.loader import DataSource             
Traceback (most recent call last):                                                                                                               
  File "<stdin>", line 1, in <module>        
  File "/home/bird/Dev/birdsarah/whotracks.me/whotracksme/data/__init__.py", line 2, in <module>
    from whotracksme.data.loader import (         
  File "/home/bird/Dev/birdsarah/whotracks.me/whotracksme/data/loader.py", line 7, in <module>
    from whotracksme.data.db import load_tracker_db, create_tracker_map
  File "/home/bird/Dev/birdsarah/whotracks.me/whotracksme/data/db.py", line 2, in <module>
    import requests                                  
ModuleNotFoundError: No module named 'requests'

Referrer leak

When visiting the tracker site from whotracks.me it sends the referrer URL.

Although the link has an attribute rel="noreferrer" , but looks like the implementation is broken in FF.

Can you try it by adding the referrer-policy in <meta> tags. Like <meta name="referrer" content="same-origin">

According to https://bugzilla.mozilla.org/show_bug.cgi?id=530396 , it should follow, but I am opening a bug separately with FF now.

Could not find whotracks_me twitter account

Should probably add a link to the Twitter account somewhere in the page.

Canvas fingerprinting warning.

When whotracks.me is opened with privacy.resistFingerprinting or in Tor browser (with JS allowed), it throws a warning related to canvas fingerprinting.

IntentIQ and Datonics owned by AlmondNet

IntentIQ and Datonics appear to belong to the same group: AlmondNet. Should this be reflected in the "Owned by" field?
(I'm not affiliated with these companies)

Custom 404 Page

Handle errors with custom templates.

Port default domain links to HTTPS

Lot of the domains mentioned in trackers, we need to find a way to find the HTTPs equivalent.

Eg:
https://whotracks.me/trackers/google_analytics.html
https://whotracks.me/trackers/bluekai.html

Update WordPress Tracker Profile

https://whotracks.me/trackers/wordpress_stats.html#

WordPress Stats is used from Jetpack. Only changes I think that needs to be made is updating from Wordpress to WordPress and the link to jetpack.com.

Mention `sass` being a dependency for building the site under `Installation` in README.md

load_apps no longer available

https://github.com/cliqz-oss/whotracks.me/blob/master/contrib/generating_adblocker_filters.py

has the following import which is no longer available: from whotracksme.data import load_apps

Pagination on websites page

With now 3,500 websites on the site, the websites listing page has grown to 3.5MB of HTML, making it very heavy to load. We should paginate this listing to make loading faster.

Website categories

Hi guys, after working extensively with Global and EU/US sites.csv datasets, I noticed wrongly categorized websites. This could be valuable to someone working with panel data from 2017 onward where website categories are important. I note problematic websites and proposed recoded categories below for both datasets.

Global sites' categories

nih.gov - categorized as Reference site until September 2018 release when it was properly categorized as Government; should be Government from the start.
targobank.de - categorized as Business site; should be Banking site.
ddl-warez.to - categorized as Recreation site; should be Entertainment site.

EU/US sites' categories

ca.gov - categorized as Reference site until September 2018 release when it was properly categorized as Government; should be Government from the start.
europa.eu - same as above.
nasa.gov - same as above.
nih.gov - same as above.
state.gov - same as above.
gov.uk - same as above.
irs.gov - categorized as Business site until September 2018 release when it was properly categorized as Government; should be Government from the start.
tax.service.gov.uk - same as above.
weather.gov - categorized as News and Portals site until September 2018 release when it was properly categorized as Government; should be Government from the start.
targobank.de - categorized as Business site; should be Banking site.
audible.com categorized as Entertainment; should be E-commerce.
discover.com categorized as News and Portals; should be Banking site.
ddl-warez.to - categorized as Recreation site; should be Entertainment site.
yelp.de categorized as Entertainment; should be Reference.
stylight.de categorized as Reference; should be E-commerce.
linguee.de categorized as News and Portals; should be Reference.

Update .sql data

How come the SQL database is still from July 2020 data when the repo has data up to Dec 2020?

Data set not available?

Hey there, i was looking for the data set of whotracks.me for a research for my uni but instead of the data i get this text:

version https://git-lfs.github.com/spec/v1
oid sha256:42f43921beed49e843db30eeeec308209e71669c4a3c650f101fc35596063519
size 219147

On Wednesday it was still available... Could you help me with that issue?

Thanks and greetings
Lena

Inconsistency in Company Name field.

While looking at the assets/trackerdb.sql, in the table trackers lot of times the company names are with quotes like "company name" and at times without quotes, is there a reason for it?

name
"Google Analytics"
DoubleClick
Google
"Google APIs"
"Google Tag Manager"
Facebook
INFOnline
"Google AdServices"
"Google Syndication"
"Amazon Web Services"```

Do you know of any repos / lists of sites tracking consumer scores?

There was a recent any article about websites tracking consumer scores and how to request your data from them.

Do you know if there is any repo listing websites such as those? When Googling the closest thing I found right away was this repo, which has quite a different purpose.

Duplicate E-commerce labels

When you visit a tracker detail page like https://whotracks.me/trackers/yieldlove.html#Entertainment, it has a section where is display categories of websites.

The label E-Commerce is duplicate.

Broken blogpost image because of uBlock Origin filter

Pages https://whotracks.me/blog.html and https://whotracks.me/blog/private_analytics.html has a broken images when visited with uBlock Origin.

As per the logs of uBlock origin, the following rule breaks the image(Screenshot attached):

13:44:07		/analytics/analytics.$image	--	image	https://whotracks.me/static/img/blog/analytics/analytics.png

Clarification on how to get a list of domains associated with fingerprinting

I've been working through the various data options trying to compile a list of domains that classify as fingerprinting. I'm getting mixed results and wondering if you can clarify, what you think of as the canonical approach.

Apologies if I'm just misreading the documentation. I'm happy to submit a PR to docs if you think it would be useful.

I can use the data source, and get a list of tracker ids as follows

fp_trackers = set()
regions = {'de', 'eu', 'fr', 'global', 'us'}
for region in regions:
    who_tracks_data = DataSource(region=region)
    who_tracks_fp = who_tracks_data.trackers.df[who_tracks_data.trackers.df.bad_qs > 0.1]
    fp_trackers.update(list(who_tracks_fp.tracker.values))

This gives me 193 trackers. I can then map this to domains using the map from create_tracker_map.

could_not_find = []
domains = set()
for tracker in fp_trackers:
    try:
        domains.update(tracker_info['trackers'][tracker]['domains'])
    except KeyError:
        could_not_find.append(tracker)

This will give me 326 domains.

If I take a different route, and read in all the csv files under assets folders labeled domains.csv, I can get a list of domains like this

domains_df = pd.concat([
    pd.read_csv(file, parse_dates=['month'])
    for file in asset_paths['domains'] # I have previously assembled all the paths
])
fingerprinting_trackers = domains_df[domains_df.bad_qs > 0.1].host_tld.unique()

But this gives me a list of 292 domains.

I can think of an explanation for this - not all host_tld's might have a bad_qs that meets the threshold but they've been added to the tracker map for other reasons.

However, given that the other csv files may also be relevant, I was starting to lose confidence and so wanted to check in.

Many thanks in advance for your help.

KeyError

After:

from whotracksme.data.loader import DataSource
data = DataSource()

I get:

data available for months:
├── 2017-05
├── 2017-06
├── 2017-07
├── 2017-08
├── 2017-09
├── 2017-10
├── 2017-11
├── 2017-12
├── 2018-01
├── 2018-02
├── 2018-03
├── 2018-04
├── 2018-05
├── 2018-06
├── 2018-07
├── 2018-08
├── 2018-09
├── 2018-10
├── 2018-11
├── 2018-12
├── 2019-01
├── 2019-02
├── 2019-03
├── 2019-04
├── 2019-05
├── 2019-06
├── 2019-07
├── 2019-08
├── 2019-09
├── 2019-10
├── 2019-11
├── 2019-12
load trackers
update/create data for 2017-05/global/trackers.csv
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kato/Applications/whotracks.me/whotracksme/data/loader.py", line 60, in __init__
    populate=populate,
  File "/home/kato/Applications/whotracks.me/whotracksme/data/loader.py", line 189, in __init__
    self.db.load_data('trackers', self.region, month)
  File "/home/kato/Applications/whotracks.me/whotracksme/data/db.py", line 312, in load_data
    [row[col] for col in name_columns] + \
KeyError: 'month'

whotracksme / whotracks.me Goto Github PK

whotracks.me's People

Contributors

Stargazers

Watchers

Forkers

whotracks.me's Issues

Global sites' categories

EU/US sites' categories

Recommend Projects

Recommend Topics

Recommend Org