whotracksme / whotracks.me Goto Github PK
View Code? Open in Web Editor NEWData from the largest and longest measurement of online tracking.
Home Page: https://whotracks.me
License: MIT License
Data from the largest and longest measurement of online tracking.
Home Page: https://whotracks.me
License: MIT License
Hi guys,
I'm receiving a following error message after cloning this repo: "This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access."
I'm attaching the log file here: 20201102T112956.908151.log
Above error persist even after: git lfs fetch
Bascially all the files in the whotracksme/data/assets/...
folders will not be converted from their LFS poiners to real .csv
files.
How can one access raw .csv
files that were once available in assets/data/...
folders?
If a 404 is triggered on a page not at the root of the site, the stylesheets will not load correctly.
https://whotracks.me/something renders correctly, but https://whotracks.me/tracker/something does not. This is because we try to address the static directly with a relative path, with the assumption that this page will always be loaded at the site root.
When a tracker/website is mentioned in a blog post, we could auto-generate a section on the tracker page 'Posts mentioning this page'.
Use case:
Hi,
I started to look at the trackers data and ran into some observations that made me wonder if there have been changes since 2019 to how this data is collected.
On the home page, in the section "The most tracked websites", there is the option to display by "traffic" or "Average number of trackers". These two currently display exactly the same content (which is average number of trackers).
We should show something different under "traffic", or not show it at all.
Use rel="noreferrer"
on links to external sites.
There seems to be an issue with whotracksme
when installed from pypi: the data is not packaged with the code. Also, would be nice to automatically update a new version on pypi every time the data is updated, this should be possible from travis.
Hi,
I have a query related to the licensing terms mentioned in this project. As licensing description is mentioned as below:
"The content of this project itself is licensed under the Creative Commons Attribution 4.0 license, and the underlying source code used to generate and display that content is licensed under the MIT license."
Q1. Does it mean that all the data, including third party/tracker information present at https://github.com/cliqz-oss/whotracks.me/tree/master/whotracksme/data/assets, are covered under Creative Common Attribution 4.0 license?
Q2. Can I also use the data present here for some experiments in my personal project?
Regards,
Ethan
Hi!
I've just found out that Adguard is listed as a tracker on whotracksme: https://whotracks.me/trackers/adguard.html
This is not quite true, but I can see where it comes from. Let me please clarify the situation.
<script src="https://local.adguard.com/blahblah/content-script.js">
that takes care of cosmetic rules.local.adguard.com
are intercepted by the network driver and processed locally. Also, we changed the domain to local.adguard.org
in the newer versions.What's important here:
Hi!
I installed project from pip and after
data = DataSource()
shows:
data available for months: ['2017-05', '2017-06', '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12', '2018-01', '2018-02', '2018-03', '2018-04']
Is there a way to download latest db or I just have to manually download db and then load the data?
I tried to install whotracks.me from source but there were library version conflicts
How to reproduce:
$ git clone https://github.com/ghostery/whotracks.me.git
$ cd whotracks.me
$ conda create -n whotracksme python=3.8
$ conda activate whotracksme
$ pip install -r requirements.txt
...
ERROR: Cannot install -r requirements.txt (line 9) and urllib3==1.26.5 because these package versions have conflicting dependencies.
The conflict is caused by:
The user requested urllib3==1.26.5
requests 2.24.0 depends on urllib3!=1.25.0, !=1.25.1, <1.26 and >=1.21.1
My system: Ubuntu 18.04.
A quick fix: update the version of requests
in requirements.txt
:
--- a/requirements.txt
+++ b/requirements.txt
@@ -6,6 +6,6 @@ numpy==1.19.1
pandas==1.1.2
python-dateutil==2.8.1
pytz==2020.1
-requests==2.24.0
+requests==2.25.1
six==1.15.0
urllib3==1.26.5
whotracks.me
has a 404 page, example:
https://whotracks.me/sadadas
But it breaks on url structure like:
https://whotracks.me/invalid/sadadas
https://whotracks.me/websites/lufthansa.com
Currently, we exceed our Git LFS limits relatively quickly. Some ideas to reduce the amount of downloaded data:
xz
is 20%-25%)Building the website today I needed to do the following tweaks to get the website to build. I encountered three errors:
height
#00000000
autotick
not a valid optionHere's the diff. I'm happy to submit as a PR if it's useful, but thought I would post first.
I have also included the output of pip freeze
. I have higher version numbers for most packages compared to what's pinned in your requrements-dev.txt
. I'm not sure why that happened, I used your instructions pip install -e '.[dev]'
to install requirements.
diff --git a/whotracksme/website/plotting/colors.py b/whotracksme/website/plotting/colors.py
index 1dbaf65..3613533 100644
--- a/whotracksme/website/plotting/colors.py
+++ b/whotracksme/website/plotting/colors.py
@@ -8,7 +8,7 @@ cliqz_colors = {
"white": "#FFFFFF",
"bright_gray": "#BFCBD6",
"inactive_gray": "#BCC4CE",
- "transparent": "#00000000",
+ "transparent": "rgba(0,0, 0, 0)",
"green": "#50B1A2",
"red": "#C3043E",
"yellow": "#FFC802",
diff --git a/whotracksme/website/plotting/companies.py b/whotracksme/website/plotting/companies.py
index 6015ab7..4e58c77 100644
--- a/whotracksme/website/plotting/companies.py
+++ b/whotracksme/website/plotting/companies.py
@@ -6,7 +6,7 @@ from whotracksme.website.plotting.plots import scatter
from whotracksme.website.plotting.colors import random_color, biggest_tracker_colors, cliqz_colors
-def overview_bars(companies, highlight=2, custom_height=True):
+def overview_bars(companies, highlight=2, height=None):
x = []
y = []
colors = [cliqz_colors["purple"]] * highlight + [cliqz_colors["inactive_gray"]] * (len(companies) - highlight)
@@ -29,7 +29,7 @@ def overview_bars(companies, highlight=2, custom_height=True):
margin=set_margins(t=30, l=150),
showlegend=False,
autosize=True,
- height=custom_height if custom_height else None,
+ height=height,
xaxis=dict(
color=cliqz_colors["gray_blue"],
tickformat="%",
diff --git a/whotracksme/website/plotting/trackers.py b/whotracksme/website/plotting/trackers.py
index 74ea953..4fe7fd5 100644
--- a/whotracksme/website/plotting/trackers.py
+++ b/whotracksme/website/plotting/trackers.py
@@ -133,7 +133,6 @@ def ts_trend(ts, t):
showgrid=False,
zeroline=False,
showline=False,
- autotick=True,
hoverformat="%b %y",
ticks='',
showticklabels=False
@@ -143,7 +142,6 @@ def ts_trend(ts, t):
showgrid=False,
zeroline=False,
showline=False,
- autotick=True,
ticks='',
showticklabels=False
)
aiofiles==0.4.0
aiohttp==3.4.4
argh==0.26.2
async-timeout==3.0.1
atomicwrites==1.2.1
attrs==18.2.0
bleach==3.0.2
boto3==1.9.27
botocore==1.12.27
certifi==2018.10.15
cffi==1.11.5
chardet==3.0.4
cmarkgfm==0.4.2
colour==0.1.5
decorator==4.3.0
docopt==0.6.2
docutils==0.14
future==0.16.0
httptools==0.0.11
idna==2.7
ipython-genutils==0.2.0
Jinja2==2.10
jmespath==0.9.3
jsonschema==2.6.0
jupyter-core==4.4.0
libsass==0.15.1
Markdown==3.0.1
MarkupSafe==1.0
more-itertools==4.3.0
multidict==4.4.2
nbformat==4.4.0
numpy==1.15.2
pandas==0.23.4
pathtools==0.1.2
pkginfo==1.4.2
plotly==3.3.0
pluggy==0.8.0
py==1.7.0
pycparser==2.19
Pygments==2.2.0
pytest==3.9.1
python-dateutil==2.7.3
pytz==2018.5
PyYAML==3.13
readme-renderer==22.0
requests==2.20.0
requests-toolbelt==0.8.0
retrying==1.3.3
s3transfer==0.1.13
sanic==0.8.3
six==1.11.0
squarify==0.3.0
tqdm==4.27.0
traitlets==4.3.2
twine==1.12.1
ujson==1.35
urllib3==1.23
uvloop==0.11.2
watchdog==0.9.0
webencodings==0.5.1
websockets==5.0.1
-e git+https://github.com/cliqz-oss/whotracks.me.git@ecc99318a7323f4eb0c765c2412ddabdf3e2f633#egg=whotracksme
yarl==1.2.6
expected result: the filter for this category should be removed
actual result: nothing happens
We should link to this repo from the site.
I want to keep up with the blog but I don't want to add it on any social media, RSS would be great. Could you add RSS feed? Thank you.
I'm CTO and Co-Founder of contentpass
I believe that the current record about our service at https://whotracks.me/trackers/contentpass.html misrepresents what contentpass is actually doing and I would love to see the record corrected.
contentpass offers cross-publisher subscriptions as well as consent management for news publishers. Publishers can make use of our product by including a JavaScript library which is hosted on our domain contentpass.net
Part of our solution is a statistics endpoint that helps publishers measure the usage of our product. Our JavaScript library sends measurement-signals to the stats endpoint on sites where our solution is being used, the endpoint is located at: https://api.contentpass.net/stats
We believe our solution is specifically interesting for privacy-aware users since – among others – it allows users to support publishers by paying a monthly subscription fee in exchange for removal of banner ads and tracking on the participating publisher sites. We're currently in the rollout process of our service: While the cross-publisher subscriptions are not available to the public yet, we're already integrating with many publishers and we're planning public availability within the next months. This is why our domains already appear on the whotracks.me database.
We have designed our service based on the principles of privacy-by-design and privacy-by-default. Among others, we take the following measures to protect the privacy of users visiting publisher websites where our service is implemented:
By the way, many of our design decisions were influenced by the "Data Collection without Privacy Side-Effects" Paper.
We're also adhering to the EFF Do Not Track (DNT) Policy as we've recently announced on our blog.
We believe that transparency is important and we therefore value your efforts with the whotracks.me database. However we also think that information shown there should be correct and not misleading.
We would therefore like to ask for our information to be corrected:
If you need any additional information I'm happy to provide it here. I also hope that opening an issue was the correct way of addressing this (since #51 is still open).
This is just a FYI issue to notify that you were added to the curated awesome-humane-tech in the 'Awareness' category, and - if you like that - are now entitled to wear our badge:
By adding this to the README:
[![Awesome Humane Tech](https://raw.githubusercontent.com/humanetech-community/awesome-humane-tech/main/humane-tech-badge.svg?sanitize=true)](https://github.com/humanetech-community/awesome-humane-tech)
Put the build date onto tracker and website pages so the freshness of the data is indicated.
The idea is to list partner networks third parties are part of, and do this for as many third parties as possible.
Eg: Acxiom Case
Feel free to use this format for new entries in the comments below.
Hi there! We've been accumulating some information on known trackers, and I guess it might be useful to you.
I've tried to make a proper pull request at first, but it's not that easy to convert what we have in your format:) Please take a look at the Doc, I've marked trackers that are missing or incomplete on whotracks.me.
https://docs.google.com/spreadsheets/d/19yJoE2UQ3eh1Gd26YMtWASBp7gh0xOBsNZr3xouggdI/edit?usp=sharing
The informational fields on trackers should be expanded to provide richer information on each entity. Fields to be added:
Was just doing a clean setup before making a PR and I noticed that requests should be in the main requirements.txt, not requirements-dev.txt.
To reproduce:
# Make env
$ pip install whotracksme
$ python
Python 3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:08:43)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux Type "help", "copyright", "credits" or "license" for more information.
>>> from whotracksme.data.loader import DataSource
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bird/Dev/birdsarah/whotracks.me/whotracksme/data/__init__.py", line 2, in <module>
from whotracksme.data.loader import (
File "/home/bird/Dev/birdsarah/whotracks.me/whotracksme/data/loader.py", line 7, in <module>
from whotracksme.data.db import load_tracker_db, create_tracker_map
File "/home/bird/Dev/birdsarah/whotracks.me/whotracksme/data/db.py", line 2, in <module>
import requests
ModuleNotFoundError: No module named 'requests'
When visiting the tracker site from whotracks.me
it sends the referrer URL.
Although the link has an attribute rel="noreferrer"
, but looks like the implementation is broken in FF.
Can you try it by adding the referrer-policy in <meta>
tags. Like <meta name="referrer" content="same-origin">
According to https://bugzilla.mozilla.org/show_bug.cgi?id=530396 , it should follow, but I am opening a bug separately with FF now.
Should probably add a link to the Twitter account somewhere in the page.
Handle errors with custom templates.
Lot of the domains mentioned in trackers, we need to find a way to find the HTTPs equivalent.
Eg:
https://whotracks.me/trackers/google_analytics.html
https://whotracks.me/trackers/bluekai.html
https://whotracks.me/trackers/wordpress_stats.html#
WordPress Stats is used from Jetpack. Only changes I think that needs to be made is updating from Wordpress to WordPress and the link to jetpack.com.
https://github.com/cliqz-oss/whotracks.me/blob/master/contrib/generating_adblocker_filters.py
has the following import which is no longer available: from whotracksme.data import load_apps
With now 3,500 websites on the site, the websites listing page has grown to 3.5MB of HTML, making it very heavy to load. We should paginate this listing to make loading faster.
Hi guys, after working extensively with Global and EU/US sites.csv
datasets, I noticed wrongly categorized websites. This could be valuable to someone working with panel data from 2017 onward where website categories are important. I note problematic websites and proposed recoded categories below for both datasets.
nih.gov
- categorized as Reference site until September 2018 release when it was properly categorized as Government; should be Government from the start.targobank.de
- categorized as Business site; should be Banking site.ddl-warez.to
- categorized as Recreation site; should be Entertainment site.ca.gov
- categorized as Reference site until September 2018 release when it was properly categorized as Government; should be Government from the start.europa.eu
- same as above.nasa.gov
- same as above.nih.gov
- same as above.state.gov
- same as above.gov.uk
- same as above.irs.gov
- categorized as Business site until September 2018 release when it was properly categorized as Government; should be Government from the start.tax.service.gov.uk
- same as above.weather.gov
- categorized as News and Portals site until September 2018 release when it was properly categorized as Government; should be Government from the start.targobank.de
- categorized as Business site; should be Banking site.audible.com
categorized as Entertainment; should be E-commerce.discover.com
categorized as News and Portals; should be Banking site.ddl-warez.to
- categorized as Recreation site; should be Entertainment site.yelp.de
categorized as Entertainment; should be Reference.stylight.de
categorized as Reference; should be E-commerce.linguee.de
categorized as News and Portals; should be Reference.How come the SQL database is still from July 2020 data when the repo has data up to Dec 2020?
Hey there, i was looking for the data set of whotracks.me for a research for my uni but instead of the data i get this text:
version https://git-lfs.github.com/spec/v1
oid sha256:42f43921beed49e843db30eeeec308209e71669c4a3c650f101fc35596063519
size 219147
On Wednesday it was still available... Could you help me with that issue?
Thanks and greetings
Lena
While looking at the assets/trackerdb.sql
, in the table trackers
lot of times the company names are with quotes like "company name"
and at times without quotes, is there a reason for it?
name
"Google Analytics"
DoubleClick
Google
"Google APIs"
"Google Tag Manager"
Facebook
INFOnline
"Google AdServices"
"Google Syndication"
"Amazon Web Services"```
There was a recent any article about websites tracking consumer scores and how to request your data from them.
Do you know if there is any repo listing websites such as those? When Googling the closest thing I found right away was this repo, which has quite a different purpose.
Pages https://whotracks.me/blog.html
and https://whotracks.me/blog/private_analytics.html
has a broken images when visited with uBlock Origin.
As per the logs of uBlock origin, the following rule breaks the image(Screenshot attached):
`
13:44:07 | /analytics/analytics.$image | -- | image | https://whotracks.me/static/img/blog/analytics/analytics.png |
---|
I've been working through the various data options trying to compile a list of domains that classify as fingerprinting. I'm getting mixed results and wondering if you can clarify, what you think of as the canonical approach.
Apologies if I'm just misreading the documentation. I'm happy to submit a PR to docs if you think it would be useful.
I can use the data source, and get a list of tracker ids as follows
fp_trackers = set()
regions = {'de', 'eu', 'fr', 'global', 'us'}
for region in regions:
who_tracks_data = DataSource(region=region)
who_tracks_fp = who_tracks_data.trackers.df[who_tracks_data.trackers.df.bad_qs > 0.1]
fp_trackers.update(list(who_tracks_fp.tracker.values))
This gives me 193 trackers. I can then map this to domains using the map from create_tracker_map.
could_not_find = []
domains = set()
for tracker in fp_trackers:
try:
domains.update(tracker_info['trackers'][tracker]['domains'])
except KeyError:
could_not_find.append(tracker)
This will give me 326 domains.
If I take a different route, and read in all the csv files under assets folders labeled domains.csv
, I can get a list of domains like this
domains_df = pd.concat([
pd.read_csv(file, parse_dates=['month'])
for file in asset_paths['domains'] # I have previously assembled all the paths
])
fingerprinting_trackers = domains_df[domains_df.bad_qs > 0.1].host_tld.unique()
But this gives me a list of 292 domains.
I can think of an explanation for this - not all host_tld's might have a bad_qs
that meets the threshold but they've been added to the tracker map for other reasons.
However, given that the other csv files may also be relevant, I was starting to lose confidence and so wanted to check in.
Many thanks in advance for your help.
After:
from whotracksme.data.loader import DataSource
data = DataSource()
I get:
data available for months:
├── 2017-05
├── 2017-06
├── 2017-07
├── 2017-08
├── 2017-09
├── 2017-10
├── 2017-11
├── 2017-12
├── 2018-01
├── 2018-02
├── 2018-03
├── 2018-04
├── 2018-05
├── 2018-06
├── 2018-07
├── 2018-08
├── 2018-09
├── 2018-10
├── 2018-11
├── 2018-12
├── 2019-01
├── 2019-02
├── 2019-03
├── 2019-04
├── 2019-05
├── 2019-06
├── 2019-07
├── 2019-08
├── 2019-09
├── 2019-10
├── 2019-11
├── 2019-12
load trackers
update/create data for 2017-05/global/trackers.csv
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/kato/Applications/whotracks.me/whotracksme/data/loader.py", line 60, in __init__
populate=populate,
File "/home/kato/Applications/whotracks.me/whotracksme/data/loader.py", line 189, in __init__
self.db.load_data('trackers', self.region, month)
File "/home/kato/Applications/whotracks.me/whotracksme/data/db.py", line 312, in load_data
[row[col] for col in name_columns] + \
KeyError: 'month'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.