Giter VIP home page Giter VIP logo

whotracks.me's People

Contributors

birdsarah avatar chrmod avatar dbalan avatar dependabot[bot] avatar ecnmst avatar humera-cliqz avatar ilaria-cliqz avatar karlolukic avatar konarkmodi avatar mdsandu avatar orenyomtov avatar philipp-classen avatar remusao avatar sammacbeth avatar smalluban avatar valmikkpatel avatar y3ti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

whotracks.me's Issues

Downloading data / cloning a repository

Hi guys,

I'm receiving a following error message after cloning this repo: "This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access."

I'm attaching the log file here: 20201102T112956.908151.log

Above error persist even after: git lfs fetch

Bascially all the files in the whotracksme/data/assets/... folders will not be converted from their LFS poiners to real .csv files.

How can one access raw .csv files that were once available in assets/data/... folders?

Changes in data collection since 2019?

Hi,
I started to look at the trackers data and ran into some observations that made me wonder if there have been changes since 2019 to how this data is collected.

  1. There seems to be missing data for 'has_blocking' in the trackers table from Jan2019 - May2019. Was there are change in instrumentation/user setup during this time? Is it ok to consider the remaining data during this period.
  2. I noticed that the use of cookies (averaged over all trackers per month) is steadily decreasing from 2019 to 2021, which isn't intuitive to me. Has this been observed/understood? If there is any literature on this, could you please point me to it?
  3. Is it possible to obtain the number of users or page loads from each country each month?
    thanks!

"most tracked websites" tables are identical

On the home page, in the section "The most tracked websites", there is the option to display by "traffic" or "Average number of trackers". These two currently display exactly the same content (which is average number of trackers).

We should show something different under "traffic", or not show it at all.

Data does not seem to be shipped with whotracksme from pypi

There seems to be an issue with whotracksme when installed from pypi: the data is not packaged with the code. Also, would be nice to automatically update a new version on pypi every time the data is updated, this should be possible from travis.

Need clarity on licensing terms

Hi,

I have a query related to the licensing terms mentioned in this project. As licensing description is mentioned as below:
"The content of this project itself is licensed under the Creative Commons Attribution 4.0 license, and the underlying source code used to generate and display that content is licensed under the MIT license."

Q1. Does it mean that all the data, including third party/tracker information present at https://github.com/cliqz-oss/whotracks.me/tree/master/whotracksme/data/assets, are covered under Creative Common Attribution 4.0 license?

Q2. Can I also use the data present here for some experiments in my personal project?

Regards,
Ethan

Adguard clarification

Hi!

I've just found out that Adguard is listed as a tracker on whotracksme: https://whotracks.me/trackers/adguard.html

This is not quite true, but I can see where it comes from. Let me please clarify the situation.

  1. AdGuard for Windows/Mac is a network level content blocker so it cannot simply add custom JS/CSS to webpages like what browser extensions do.
  2. In order to do it, it injects a content script: <script src="https://local.adguard.com/blahblah/content-script.js"> that takes care of cosmetic rules.
  3. Connections to local.adguard.com are intercepted by the network driver and processed locally. Also, we changed the domain to local.adguard.org in the newer versions.
  4. This is a usual approach for network-level software. For instance, you have Kaspersky listed as a tracker because of the very same thing -- they add a content script to every page.

What's important here:

  1. There are no remote connections, everything is processed locally
  2. There is no tracking, fingerprinting or whatever.

database update

Hi!
I installed project from pip and after

data = DataSource()

shows:

data available for months: ['2017-05', '2017-06', '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12', '2018-01', '2018-02', '2018-03', '2018-04']
Is there a way to download latest db or I just have to manually download db and then load the data?

Library version conflicts when installing from source

I tried to install whotracks.me from source but there were library version conflicts
How to reproduce:

 $  git clone https://github.com/ghostery/whotracks.me.git
 $  cd whotracks.me
 $  conda create -n whotracksme python=3.8
 $  conda activate whotracksme
 $  pip install -r requirements.txt
...
ERROR: Cannot install -r requirements.txt (line 9) and urllib3==1.26.5 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested urllib3==1.26.5
    requests 2.24.0 depends on urllib3!=1.25.0, !=1.25.1, <1.26 and >=1.21.1

My system: Ubuntu 18.04.

A quick fix: update the version of requests in requirements.txt:

--- a/requirements.txt
+++ b/requirements.txt
@@ -6,6 +6,6 @@ numpy==1.19.1
 pandas==1.1.2
 python-dateutil==2.8.1
 pytz==2020.1
-requests==2.24.0
+requests==2.25.1
 six==1.15.0
 urllib3==1.26.5

Conserve Git LFS bandwidth

Currently, we exceed our Git LFS limits relatively quickly. Some ideas to reduce the amount of downloaded data:

  • By default, download only the most recent data (configure lfs.fetchinclude / lfs.fetchexclude, see git-lfs/git-lfs#2717). New data per month takes around 250M, while the whole data set is currently around 6,7G (and will grow with each month).
  • Compress the csv files (the expected compression ratio with xz is 20%-25%)

Small plotting tweaks to get website to build

Building the website today I needed to do the following tweaks to get the website to build. I encountered three errors:

  • True is not a valid entry for height
  • Invalid color #00000000
  • autotick not a valid option

Here's the diff. I'm happy to submit as a PR if it's useful, but thought I would post first.

I have also included the output of pip freeze. I have higher version numbers for most packages compared to what's pinned in your requrements-dev.txt. I'm not sure why that happened, I used your instructions pip install -e '.[dev]' to install requirements.

diff --git a/whotracksme/website/plotting/colors.py b/whotracksme/website/plotting/colors.py
index 1dbaf65..3613533 100644
--- a/whotracksme/website/plotting/colors.py
+++ b/whotracksme/website/plotting/colors.py
@@ -8,7 +8,7 @@ cliqz_colors = {
     "white": "#FFFFFF",
     "bright_gray": "#BFCBD6",
     "inactive_gray": "#BCC4CE",
-    "transparent": "#00000000",
+    "transparent": "rgba(0,0, 0, 0)",
     "green": "#50B1A2",
     "red": "#C3043E",
     "yellow": "#FFC802",
diff --git a/whotracksme/website/plotting/companies.py b/whotracksme/website/plotting/companies.py
index 6015ab7..4e58c77 100644
--- a/whotracksme/website/plotting/companies.py
+++ b/whotracksme/website/plotting/companies.py
@@ -6,7 +6,7 @@ from whotracksme.website.plotting.plots import scatter
 from whotracksme.website.plotting.colors import random_color, biggest_tracker_colors, cliqz_colors
 
 
-def overview_bars(companies, highlight=2, custom_height=True):
+def overview_bars(companies, highlight=2, height=None):
     x = []
     y = []
     colors = [cliqz_colors["purple"]] * highlight + [cliqz_colors["inactive_gray"]] * (len(companies) - highlight)
@@ -29,7 +29,7 @@ def overview_bars(companies, highlight=2, custom_height=True):
             margin=set_margins(t=30, l=150),
             showlegend=False,
             autosize=True,
-            height=custom_height if custom_height else None,
+            height=height,
             xaxis=dict(
                 color=cliqz_colors["gray_blue"],
                 tickformat="%",
diff --git a/whotracksme/website/plotting/trackers.py b/whotracksme/website/plotting/trackers.py
index 74ea953..4fe7fd5 100644
--- a/whotracksme/website/plotting/trackers.py
+++ b/whotracksme/website/plotting/trackers.py
@@ -133,7 +133,6 @@ def ts_trend(ts, t):
                 showgrid=False,
                 zeroline=False,
                 showline=False,
-                autotick=True,
                 hoverformat="%b %y",
                 ticks='',
                 showticklabels=False
@@ -143,7 +142,6 @@ def ts_trend(ts, t):
                 showgrid=False,
                 zeroline=False,
                 showline=False,
-                autotick=True,
                 ticks='',
                 showticklabels=False
             )
aiofiles==0.4.0
aiohttp==3.4.4
argh==0.26.2
async-timeout==3.0.1
atomicwrites==1.2.1
attrs==18.2.0
bleach==3.0.2
boto3==1.9.27
botocore==1.12.27
certifi==2018.10.15
cffi==1.11.5
chardet==3.0.4
cmarkgfm==0.4.2
colour==0.1.5
decorator==4.3.0
docopt==0.6.2
docutils==0.14
future==0.16.0
httptools==0.0.11
idna==2.7
ipython-genutils==0.2.0
Jinja2==2.10
jmespath==0.9.3
jsonschema==2.6.0
jupyter-core==4.4.0
libsass==0.15.1
Markdown==3.0.1
MarkupSafe==1.0
more-itertools==4.3.0
multidict==4.4.2
nbformat==4.4.0
numpy==1.15.2
pandas==0.23.4
pathtools==0.1.2
pkginfo==1.4.2
plotly==3.3.0
pluggy==0.8.0
py==1.7.0
pycparser==2.19
Pygments==2.2.0
pytest==3.9.1
python-dateutil==2.7.3
pytz==2018.5
PyYAML==3.13
readme-renderer==22.0
requests==2.20.0
requests-toolbelt==0.8.0
retrying==1.3.3
s3transfer==0.1.13
sanic==0.8.3
six==1.11.0
squarify==0.3.0
tqdm==4.27.0
traitlets==4.3.2
twine==1.12.1
ujson==1.35
urllib3==1.23
uvloop==0.11.2
watchdog==0.9.0
webencodings==0.5.1
websockets==5.0.1
-e git+https://github.com/cliqz-oss/whotracks.me.git@ecc99318a7323f4eb0c765c2412ddabdf3e2f633#egg=whotracksme
yarl==1.2.6

Filtering buttons for 'Presence on top sites' do not work properly

  1. Visit a page of any tracker
  2. In 'Presence on top sites' section click on one of the buttons to filter the trackers by category, eg 'Adult'
  3. Click on the same button again

expected result: the filter for this category should be removed
actual result: nothing happens

RSS?

I want to keep up with the blog but I don't want to add it on any social media, RSS would be great. Could you add RSS feed? Thank you.

Correct classification of contentpass

I'm CTO and Co-Founder of contentpass

I believe that the current record about our service at https://whotracks.me/trackers/contentpass.html misrepresents what contentpass is actually doing and I would love to see the record corrected.

contentpass offers cross-publisher subscriptions as well as consent management for news publishers. Publishers can make use of our product by including a JavaScript library which is hosted on our domain contentpass.net

Part of our solution is a statistics endpoint that helps publishers measure the usage of our product. Our JavaScript library sends measurement-signals to the stats endpoint on sites where our solution is being used, the endpoint is located at: https://api.contentpass.net/stats

We believe our solution is specifically interesting for privacy-aware users since – among others – it allows users to support publishers by paying a monthly subscription fee in exchange for removal of banner ads and tracking on the participating publisher sites. We're currently in the rollout process of our service: While the cross-publisher subscriptions are not available to the public yet, we're already integrating with many publishers and we're planning public availability within the next months. This is why our domains already appear on the whotracks.me database.

We have designed our service based on the principles of privacy-by-design and privacy-by-default. Among others, we take the following measures to protect the privacy of users visiting publisher websites where our service is implemented:

  • we do not store any IP addresses (not even truncated IP addresses).
  • we do not collect any personally identifiable information (PII).
  • we do not collect any sort of unique user identifiers (uid) that would allow reconstruction of a browsing session.
  • we do not collect device information which would allow device fingerprinting (i.e. no screen resolution, no information about installed plugins, etc.).
  • we do not perform any cross-domain and/or 3rd-party tracking.

By the way, many of our design decisions were influenced by the "Data Collection without Privacy Side-Effects" Paper.

We're also adhering to the EFF Do Not Track (DNT) Policy as we've recently announced on our blog.

We believe that transparency is important and we therefore value your efforts with the whotracks.me database. However we also think that information shown there should be correct and not misleading.

We would therefore like to ask for our information to be corrected:

  1. The record currently claims that we employ fingerprinting. This is not correct, we do not perform any fingerprinting, have never done so in the past and will never do it in the future. In fact, we do not even do anything that would qualify as "tracking" in the terms defined in the What is a tracker? blog post.
  2. We're currently categorized as "Advertising", however we're doing quite the opposite: We're offering subscriptions for ad-free and premium access to publisher websites, as well as consent management. We think that the category "Essential" would describe much better what we're doing.

If you need any additional information I'm happy to provide it here. I also hope that opening an issue was the correct way of addressing this (since #51 is still open).

You have been added to awesome-humane-tech

This is just a FYI issue to notify that you were added to the curated awesome-humane-tech in the 'Awareness' category, and - if you like that - are now entitled to wear our badge:

Awesome Humane Tech

By adding this to the README:

[![Awesome Humane Tech](https://raw.githubusercontent.com/humanetech-community/awesome-humane-tech/main/humane-tech-badge.svg?sanitize=true)](https://github.com/humanetech-community/awesome-humane-tech)

https://github.com/humanetech-community/awesome-humane-tech

Expand tracker database fields

The informational fields on trackers should be expanded to provide richer information on each entity. Fields to be added:

  • Operating country
  • Privacy contact / Data protection officer
  • Description/in their own words: A short description of what the tracker/company does which can be displayed on the website.

Missing dependency on requests

Was just doing a clean setup before making a PR and I noticed that requests should be in the main requirements.txt, not requirements-dev.txt.

To reproduce:

# Make env
$ pip install whotracksme
$ python

Python 3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:08:43)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux                                                                                                  Type "help", "copyright", "credits" or "license" for more information.
>>> from whotracksme.data.loader import DataSource             
Traceback (most recent call last):                                                                                                               
  File "<stdin>", line 1, in <module>        
  File "/home/bird/Dev/birdsarah/whotracks.me/whotracksme/data/__init__.py", line 2, in <module>
    from whotracksme.data.loader import (         
  File "/home/bird/Dev/birdsarah/whotracks.me/whotracksme/data/loader.py", line 7, in <module>
    from whotracksme.data.db import load_tracker_db, create_tracker_map
  File "/home/bird/Dev/birdsarah/whotracks.me/whotracksme/data/db.py", line 2, in <module>
    import requests                                  
ModuleNotFoundError: No module named 'requests'   

Referrer leak

When visiting the tracker site from whotracks.me it sends the referrer URL.
visiting-twitter-from-who-tracks-me

Although the link has an attribute rel="noreferrer" , but looks like the implementation is broken in FF.

Can you try it by adding the referrer-policy in <meta> tags. Like <meta name="referrer" content="same-origin">

According to https://bugzilla.mozilla.org/show_bug.cgi?id=530396 , it should follow, but I am opening a bug separately with FF now.

Canvas fingerprinting warning.

When whotracks.me is opened with privacy.resistFingerprinting or in Tor browser (with JS allowed), it throws a warning related to canvas fingerprinting.

cliqz-warning

tor-warning

Pagination on websites page

With now 3,500 websites on the site, the websites listing page has grown to 3.5MB of HTML, making it very heavy to load. We should paginate this listing to make loading faster.

Website categories

Hi guys, after working extensively with Global and EU/US sites.csv datasets, I noticed wrongly categorized websites. This could be valuable to someone working with panel data from 2017 onward where website categories are important. I note problematic websites and proposed recoded categories below for both datasets.

Global sites' categories

  • nih.gov - categorized as Reference site until September 2018 release when it was properly categorized as Government; should be Government from the start.
  • targobank.de - categorized as Business site; should be Banking site.
  • ddl-warez.to - categorized as Recreation site; should be Entertainment site.

EU/US sites' categories

  • ca.gov - categorized as Reference site until September 2018 release when it was properly categorized as Government; should be Government from the start.
  • europa.eu - same as above.
  • nasa.gov - same as above.
  • nih.gov - same as above.
  • state.gov - same as above.
  • gov.uk - same as above.
  • irs.gov - categorized as Business site until September 2018 release when it was properly categorized as Government; should be Government from the start.
  • tax.service.gov.uk - same as above.
  • weather.gov - categorized as News and Portals site until September 2018 release when it was properly categorized as Government; should be Government from the start.
  • targobank.de - categorized as Business site; should be Banking site.
  • audible.com categorized as Entertainment; should be E-commerce.
  • discover.com categorized as News and Portals; should be Banking site.
  • ddl-warez.to - categorized as Recreation site; should be Entertainment site.
  • yelp.de categorized as Entertainment; should be Reference.
  • stylight.de categorized as Reference; should be E-commerce.
  • linguee.de categorized as News and Portals; should be Reference.

Update .sql data

How come the SQL database is still from July 2020 data when the repo has data up to Dec 2020?

Data set not available?

Hey there, i was looking for the data set of whotracks.me for a research for my uni but instead of the data i get this text:

version https://git-lfs.github.com/spec/v1
oid sha256:42f43921beed49e843db30eeeec308209e71669c4a3c650f101fc35596063519
size 219147

On Wednesday it was still available... Could you help me with that issue?

Thanks and greetings
Lena

Inconsistency in Company Name field.

While looking at the assets/trackerdb.sql, in the table trackers lot of times the company names are with quotes like "company name" and at times without quotes, is there a reason for it?

name
"Google Analytics"
DoubleClick
Google
"Google APIs"
"Google Tag Manager"
Facebook
INFOnline
"Google AdServices"
"Google Syndication"
"Amazon Web Services"```

Duplicate E-commerce labels

Screenshot 2019-07-18 at 10 06 59

When you visit a tracker detail page like https://whotracks.me/trackers/yieldlove.html#Entertainment, it has a section where is display categories of websites.

The label E-Commerce is duplicate.

Clarification on how to get a list of domains associated with fingerprinting

I've been working through the various data options trying to compile a list of domains that classify as fingerprinting. I'm getting mixed results and wondering if you can clarify, what you think of as the canonical approach.

Apologies if I'm just misreading the documentation. I'm happy to submit a PR to docs if you think it would be useful.

I can use the data source, and get a list of tracker ids as follows

fp_trackers = set()
regions = {'de', 'eu', 'fr', 'global', 'us'}
for region in regions:
    who_tracks_data = DataSource(region=region)
    who_tracks_fp = who_tracks_data.trackers.df[who_tracks_data.trackers.df.bad_qs > 0.1]
    fp_trackers.update(list(who_tracks_fp.tracker.values))

This gives me 193 trackers. I can then map this to domains using the map from create_tracker_map.

could_not_find = []
domains = set()
for tracker in fp_trackers:
    try:
        domains.update(tracker_info['trackers'][tracker]['domains'])
    except KeyError:
        could_not_find.append(tracker)

This will give me 326 domains.

If I take a different route, and read in all the csv files under assets folders labeled domains.csv, I can get a list of domains like this

domains_df = pd.concat([
    pd.read_csv(file, parse_dates=['month'])
    for file in asset_paths['domains'] # I have previously assembled all the paths
])
fingerprinting_trackers = domains_df[domains_df.bad_qs > 0.1].host_tld.unique()

But this gives me a list of 292 domains.

I can think of an explanation for this - not all host_tld's might have a bad_qs that meets the threshold but they've been added to the tracker map for other reasons.

However, given that the other csv files may also be relevant, I was starting to lose confidence and so wanted to check in.

Many thanks in advance for your help.

KeyError

After:

from whotracksme.data.loader import DataSource
data = DataSource()

I get:

data available for months:
├── 2017-05
├── 2017-06
├── 2017-07
├── 2017-08
├── 2017-09
├── 2017-10
├── 2017-11
├── 2017-12
├── 2018-01
├── 2018-02
├── 2018-03
├── 2018-04
├── 2018-05
├── 2018-06
├── 2018-07
├── 2018-08
├── 2018-09
├── 2018-10
├── 2018-11
├── 2018-12
├── 2019-01
├── 2019-02
├── 2019-03
├── 2019-04
├── 2019-05
├── 2019-06
├── 2019-07
├── 2019-08
├── 2019-09
├── 2019-10
├── 2019-11
├── 2019-12
load trackers
update/create data for 2017-05/global/trackers.csv
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kato/Applications/whotracks.me/whotracksme/data/loader.py", line 60, in __init__
    populate=populate,
  File "/home/kato/Applications/whotracks.me/whotracksme/data/loader.py", line 189, in __init__
    self.db.load_data('trackers', self.region, month)
  File "/home/kato/Applications/whotracks.me/whotracksme/data/db.py", line 312, in load_data
    [row[col] for col in name_columns] + \
KeyError: 'month'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.