Giter VIP home page Giter VIP logo

data's Introduction

There are different ways to access OONI data, wether that is via: OONI Explorer, the OONI API or clickhouse table dumps.

The OONI API is meant for developers and researches and allows searching for measurement metadata, fetching single measurements, and generating statistics.

Hovever the OONI API, is not designed for large data transfers (i.e. extracting tens of thousands of measurements or many GB of data) and implements rate limiting API. If you are interested in a dump of the clickhouse tables, please reach out to us instead of scraping our API.

Researchers can access the raw measurement data from an S3 bucket. The specifications of the OONI data formats can be found in ooni/spec.

Accessing raw measurement data

"Raw measurement data" refers to data structures uploaded by OONI Probes (run by volunteers worldwide) to the processing pipeline.

Thanks to the Amazon Open Data program, the whole OONI dataset can be fetched from the ooni-data-eu-fra Amazon S3 bucket.

A single chunk of data is called "a measurement" and its uncompressed size can vary between 1KB to 1MB, roughly.

Probes usually upload multiple measurements on each execution. Measurements are stored temporarily and then batched together, compressed and uploaded to the S3 bucket once every hour. To ensure transparency, incoming measurements go through basic content validation and the API returns success or error; once a measurement is accepted it will be published on S3.

OONI measurements are also processed by the fastpath and made immediately available on OONI Explorer. See the "receive_measurement" function in the probe_services.py file in the API codebase for details.

The commands which follow will be using the aws s3 cli tool. See their documentation on how to install it.

Since OONI data is part of the AWS Open Data program, you don't have to pay for access and you can use the --no-sign-request flag to access it for free.

File paths in the S3 bucket in JSONL format

Contains a JSON document for each measurement, separated by newline and compressed, for easy processing. The path structure allows to easily select, identify and download data based on the researcher's needs.

In the path template:

  • cc is an uppercase 2 letter country code
  • testname is a test name where underscores are removed
  • timestamp is a YYYYMMDD timestamp
  • name is a unique filename

Compressed JSONL from measurements starting from 2020-10-20

The path structure is: s3://ooni-data-eu-fra/raw/<timestamp>/<hour>/<cc>/<testname>/<ts2>_<cc>_<testname>.<host_id>.<counter>.jsonl.gz

Example: s3://ooni-data-eu-fra/raw/20210817/15/US/webconnectivity/2021081715_US_webconnectivity.n0.0.jsonl.gz

Note: The path will be updated in the future to live under /jsonl/

Listing JSONL files:

aws s3 --no-sign-request ls \
    s3://ooni-data-eu-fra/raw/20210817/15/US/webconnectivity/

Downloading entire dates

If you would like to download the raw measurements for a particular country, you can use the aws s3 sync command.

For example to download all JSONL measurements from Italy on the 1st of February 2024, you can run:

aws s3 --no-sign-request sync \
    s3://ooni-data-eu-fra/raw/20240201/ ./ \
    --exclude "*" --include "*/IT/*.jsonl.gz"

Note: the difference in paths compared to older data

Compressed JSONL from measurements before 2020-10-21

The path structure is: s3://ooni-data-eu-fra/jsonl/<testname>/<cc>/<timestamp>/00/<name>.jsonl.gz

Example: s3://ooni-data-eu-fra/jsonl/webconnectivity/IT/20200921/00/20200921_IT_webconnectivity.l.0.jsonl.gz

Listing JSONL files:

aws s3 --no-sign-request ls s3://ooni-data-eu-fra/jsonl/
aws s3 --no-sign-request ls \
    s3://ooni-data-eu-fra/jsonl/webconnectivity/US/20201021/00/

Downloading entire dates

If you would like to download the raw measurements for a particular country, you can use the aws s3 sync command.

For example to download webconnectivity measurements from Italy on the 1st of February 2024, you can run:

aws s3 --no-sign-request sync \
    s3://ooni-data-eu-fra/jsonl/webconnectivity/IT/20200201/ ./ \
    --exclude "*" \
    --include "*"

Note: the difference in paths compared to newer data

Raw "postcans" from measurements starting from 2020-10-20

A "postcan" is tarball containing measurements as they are uploaded by the probes, optionally compressed. Each HTTP POST is stored in the tarball as <timestamp>_<cc>_<testname>/<timestamp>_<cc>_<testname>_<hash>.post

Example: s3://ooni-data-eu-fra/raw/20210817/11/GB/webconnectivity/2021081711_GB_webconnectivity.n0.0.tar.gz

Listing postcan files:

aws s3 --no-sign-request ls s3://ooni-data-eu-fra/raw/20210817/
aws s3 --no-sign-request ls \
    s3://ooni-data-eu-fra/raw/20210817/11/GB/webconnectivity/

data's People

Contributors

hellais avatar decfox avatar bassosimone avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.