mozilla / prio-processor Goto Github PK

A processing engine for prio data

Home Page: https://hub.docker.com/r/mozilla/prio-processor

License: Mozilla Public License 2.0

Makefile 0.02% Python 34.02% Dockerfile 0.56% Shell 18.26% Jupyter Notebook 41.55% HCL 5.58%

prio-processor's Introduction

prio-processor

Prio is a system for aggregating data in a privacy-preserving way. This repository includes a command-line tool for batch processing in Prio's multi-server architecture.

For more information about Prio, see this blog post.

Docker

This project contains a pre-configured build and test environment via docker.

make

# or run directly though docker-compose
docker-compose build

You can mount your working directory and shell into the container for development work.

docker-compose run -v $PWD:/app prio_processor bash

Adding new dependencies

To add new Python dependencies to the container, use pip-tools to manage the requirements.txt.

pip install pip-tools

# generate the installation requirements from setup.py
pip-compile

# generate dev requirements
pip-compile requirements-dev.in

Any new system dependencies should be added to the Dockerfile at the root of the repository. These will be available during runtime.

Deployment Configuration

See the deployment directory for examples of configuration that can be used to aid deployment. These may also be run as integration tests to determine whether resources are configured properly. These will typically assume Google Cloud Platform (GCP) as a resource provider.

See the guide for more details.

prio-processor's People

Contributors

Stargazers

Watchers

Forkers

acmiyaguchi mozilla-github-standards wlach degregat

prio-processor's Issues

`test-batch-example` failing on CI

Upload package to pypi on version tag

Fix dockerhub deploys

#79 separated building the dev image from the prod image. However, this makes deploys fail:

#!/bin/sh -eo pipefail
docker build --target production -t prio:prod .

unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /root/project/Dockerfile: no such file or directory

Exited with code exit status 1

The final workflow stage does not have the repository checked out.

`bin/process` fails when dataset for batch_id does not exist

In our nonprod job:

[2020-09-25 03:28:23,142] {logging_mixin.py:112} INFO - [2020-09-25 03:28:23,142] {pod_launcher.py:125} INFO - b'+ spark-submit --name origin-telemetry --py-files /tmp/bootstrap-Fdu/prio_processor.egg /tmp/bootstrap-Fdu/processor-spark.py verify1 --batch-id content.blocking_opener_after_user_interaction_exempt-0 --n-data 2046 --input gs://moz-fx-prio-dev-a-private/data/v1/F992B575840AEC202289FBF99D6C04FB2A37B1DA1CDEB1DF8036E1340D46C561/origin-telemetry/2020-09-24/raw/shares/batch_id=content.blocking_opener_after_user_interaction_exempt-0 --output gs://moz-fx-prio-dev-a-shared/data/v1/F992B575840AEC202289FBF99D6C04FB2A37B1DA1CDEB1DF8036E1340D46C561/origin-telemetry/2020-09-24/intermediate/internal/verify1/batch_id=content.blocking_opener_after_user_interaction_exempt-0\n'
[2020-09-25 03:28:26,361] {logging_mixin.py:112} INFO - [2020-09-25 03:28:26,361] {pod_launcher.py:125} INFO - b'20/09/25 03:28:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n'
[2020-09-25 03:28:28,616] {logging_mixin.py:112} INFO - [2020-09-25 03:28:28,616] {pod_launcher.py:125} INFO - b'Running verify1\n'
[2020-09-25 03:28:36,151] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,151] {pod_launcher.py:125} INFO - b'Traceback (most recent call last):\n'
[2020-09-25 03:28:36,152] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,152] {pod_launcher.py:125} INFO - b'  File "/tmp/bootstrap-Fdu/processor-spark.py", line 2, in <module>\n'
[2020-09-25 03:28:36,153] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,152] {pod_launcher.py:125} INFO - b'    commands.entry_point()\n'
[2020-09-25 03:28:36,153] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,153] {pod_launcher.py:125} INFO - b'  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 829, in __call__\n'
[2020-09-25 03:28:36,154] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,154] {pod_launcher.py:125} INFO - b'    return self.main(*args, **kwargs)\n'
[2020-09-25 03:28:36,154] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,154] {pod_launcher.py:125} INFO - b'  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 782, in main\n'
[2020-09-25 03:28:36,155] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,155] {pod_launcher.py:125} INFO - b'    rv = self.invoke(ctx)\n'
[2020-09-25 03:28:36,155] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,155] {pod_launcher.py:125} INFO - b'  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 1259, in invoke\n'
[2020-09-25 03:28:36,156] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,156] {pod_launcher.py:125} INFO - b'    return _process_result(sub_ctx.command.invoke(sub_ctx))\n'
[2020-09-25 03:28:36,156] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,156] {pod_launcher.py:125} INFO - b'  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke\n'
[2020-09-25 03:28:36,157] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,156] {pod_launcher.py:125} INFO - b'    return ctx.invoke(self.callback, **ctx.params)\n'
[2020-09-25 03:28:36,157] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,157] {pod_launcher.py:125} INFO - b'  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 610, in invoke\n'
[2020-09-25 03:28:36,158] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,158] {pod_launcher.py:125} INFO - b'    return callback(*args, **kwargs)\n'
[2020-09-25 03:28:36,159] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,158] {pod_launcher.py:125} INFO - b'  File "/tmp/bootstrap-Fdu/prio_processor.egg/prio_processor/spark/commands.py", line 261, in verify1\n'
[2020-09-25 03:28:36,159] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,159] {pod_launcher.py:125} INFO - b'  File "/usr/local/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 300, in json\n'
[2020-09-25 03:28:36,162] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,162] {pod_launcher.py:125} INFO - b'  File "/usr/local/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__\n'
[2020-09-25 03:28:36,163] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,163] {pod_launcher.py:125} INFO - b'  File "/usr/local/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 137, in deco\n'
[2020-09-25 03:28:36,163] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,163] {pod_launcher.py:125} INFO - b'  File "<string>", line 3, in raise_from\n'
[2020-09-25 03:28:36,191] {logging_mixin.py:112} INFO - [2020-09-25 03:28:36,191] {pod_launcher.py:125} INFO - b'pyspark.sql.utils.AnalysisException: Path does not exist: gs://moz-fx-prio-dev-a-private/data/v1/F992B575840AEC202289FBF99D6C04FB2A37B1DA1CDEB1DF8036E1340D46C561/origin-telemetry/2020-09-24/raw/shares/batch_id=content.blocking_opener_after_user_interaction_exempt-0;\n'

Create a standard benchmarking dataset

There should be a standard benchmarking dataset which can be used to stress a reasonable deployment of the servers.

10^6 data points with 2000 elements should result in a 50GB dataset with 200 partitions at 250MB each. A reasonable benchmark would be roughly a tenth of this size at 20 partitions or 5GB.

This supplements the existing end-to-end testing data e.g. processor/bin/generate.

Add `BUCKET_INTERNAL_INGESTION` for initializing data

Currently, the admin server places data into BUCKET_INTERNAL_PRIVATE, which is also used for intermediate values. This can lead to the administrator (the server that handles ingestion/partitioning data) to have overly broad access to each of the server buckets. In the Origin Telemetry case, server A and the admin server are both run in Airflow with access to the same set of individuals.

Expose OO-interface to the module namespace

In __init__.py, the __all__ variable should be modified to import everything from prio.

Set of valid batch ids should be filtered in staging

prio-processor/processor/prio_processor/staging.py

Lines 77 to 88 in 3cdc368

 df = ( 

 data 

 # NOTE: is it worth counting duplicates? 

 .dropDuplicates(["id"]) 

 .withColumn("data", explode("prioData")) 

 # drop the id and assign a new one per encoding type 

 # this id is used as a join-key during the decoding process 

 .select(col("data.encoding").alias("batch_id"), "data.prio") 

 .withColumn( 

 "id", row_number().over(Window.partitionBy("batch_id").orderBy(lit(0))) 

 ) 

 )

The partitioning job should limit the set of valid batch ids to those set in content.json.

[2019-08-07 23:44:33,104] {logging_mixin.py:95} INFO - [2019-08-07 23:44:33,103] {pod_launcher.py:104} INFO - + parallel process ::: 'submission_date=2019-08-05/batch_id=content.blocking_blocked_TESTONLY-0/part-00000-389a770c-9507-4c6e-a669-160366928daf.c000.json
[2019-08-07 23:44:33,105] {logging_mixin.py:95} INFO - [2019-08-07 23:44:33,105] {pod_launcher.py:104} INFO - submission_date=2019-08-05/batch_id=content.blocking_blocked-1/part-00000-389a770c-9507-4c6e-a669-160366928daf.c000.json
[2019-08-07 23:44:33,105] {logging_mixin.py:95} INFO - [2019-08-07 23:44:33,105] {pod_launcher.py:104} INFO - submission_date=2019-08-05/batch_id=content.blocking_blocked-0/part-00000-389a770c-9507-4c6e-a669-160366928daf.c000.json
[2019-08-07 23:44:33,106] {logging_mixin.py:95} INFO - [2019-08-07 23:44:33,106] {pod_launcher.py:104} INFO - submission_date=2019-08-05/batch_id=content.blocking_blocked_TESTONLY-1/part-00000-389a770c-9507-4c6e-a669-160366928daf.c000.json'
[2019-08-07 23:44:42,679] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,678] {pod_launcher.py:104} INFO - Running verify1
[2019-08-07 23:44:42,681] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,681] {pod_launcher.py:104} INFO - Traceback (most recent call last):
[2019-08-07 23:44:42,684] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,683] {pod_launcher.py:104} INFO -   File "/usr/local/bin/prio", line 11, in <module>
[2019-08-07 23:44:42,688] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,687] {pod_launcher.py:104} INFO -     sys.exit(main())
[2019-08-07 23:44:42,690] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,689] {pod_launcher.py:104} INFO -   File "/usr/local/lib64/python3.6/site-packages/click/core.py", line 764, in __call__
[2019-08-07 23:44:42,691] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,691] {pod_launcher.py:104} INFO -     return self.main(*args, **kwargs)
[2019-08-07 23:44:42,693] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,692] {pod_launcher.py:104} INFO -   File "/usr/local/lib64/python3.6/site-packages/click/core.py", line 717, in main
[2019-08-07 23:44:42,695] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,694] {pod_launcher.py:104} INFO -     rv = self.invoke(ctx)
[2019-08-07 23:44:42,697] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,696] {pod_launcher.py:104} INFO -   File "/usr/local/lib64/python3.6/site-packages/click/core.py", line 1137, in invoke
[2019-08-07 23:44:42,699] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,698] {pod_launcher.py:104} INFO -     return _process_result(sub_ctx.command.invoke(sub_ctx))
[2019-08-07 23:44:42,700] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,700] {pod_launcher.py:104} INFO -   File "/usr/local/lib64/python3.6/site-packages/click/core.py", line 956, in invoke
[2019-08-07 23:44:42,701] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,701] {pod_launcher.py:104} INFO -     return ctx.invoke(self.callback, **ctx.params)
[2019-08-07 23:44:42,704] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,703] {pod_launcher.py:104} INFO -   File "/usr/local/lib64/python3.6/site-packages/click/core.py", line 555, in invoke
[2019-08-07 23:44:42,707] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,706] {pod_launcher.py:104} INFO -     return callback(*args, **kwargs)
[2019-08-07 23:44:42,709] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,709] {pod_launcher.py:104} INFO -   File "/usr/local/lib64/python3.6/site-packages/prio/cli/commands.py", line 134, in verify1
[2019-08-07 23:44:42,711] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,711] {pod_launcher.py:104} INFO -     libprio.PrioVerifier_set_data(verifier, share)
[2019-08-07 23:44:42,714] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,713] {pod_launcher.py:104} INFO -   File "/usr/local/lib64/python3.6/site-packages/prio/libprio.py", line 319, in PrioVerifier_set_data
[2019-08-07 23:44:42,716] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,715] {pod_launcher.py:104} INFO -     return _libprio.PrioVerifier_set_data(*args)
[2019-08-07 23:44:42,717] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,717] {pod_launcher.py:104} INFO - RuntimeError: PrioVerifier_set_data was not successful.
[2019-08-07 23:44:42,846] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,845] {pod_launcher.py:104} INFO - Usage: prio verify1 [OPTIONS]
[2019-08-07 23:44:42,847] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,846] {pod_launcher.py:104} INFO - Try "prio verify1 --help" for help.
[2019-08-07 23:44:42,848] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,848] {pod_launcher.py:104} INFO -
[2019-08-07 23:44:42,849] {logging_mixin.py:95} INFO - [2019-08-07 23:44:42,849] {pod_launcher.py:104} INFO - Error: Invalid value for "--n-data": null is not a valid integer
[2019-08-07 23:44:43,062] {logging_mixin.py:95} INFO - [2019-08-07 23:44:43,062] {pod_launcher.py:104} INFO - Usage: prio verify1 [OPTIONS]
[2019-08-07 23:44:43,063] {logging_mixin.py:95} INFO - [2019-08-07 23:44:43,063] {pod_launcher.py:104} INFO - Try "prio verify1 --help" for help.
[2019-08-07 23:44:43,064] {logging_mixin.py:95} INFO - [2019-08-07 23:44:43,064] {pod_launcher.py:104} INFO -
[2019-08-07 23:44:43,065] {logging_mixin.py:95} INFO - [2019-08-07 23:44:43,065] {pod_launcher.py:104} INFO - Error: Invalid value for "--n-data": null is not a valid integer

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].

(Message COC001)

Set up key management using Cloud KMS

The current production keys are currently located in sops as per bug 1552315. The container should expect to pull the private keys from a bucket that's been populated with the keys. These keys are copied to the container and decrypted into exported environment variables. The gcloud kms utility includes audit logs that can be monitored for proper usage.

Add CircleCI support

This is an example of a simple CircleCI configuration that can be use here. This repo has a docker container that will automatically run the tests.

Remove `Prio` prefix from the SWIG wrapper

Section 5.4.7 of the SWIG documentation gives the syntax for renaming functions. In particular:

%rename("%(strip:[wx])s") ""; // wxHello -> Hello; FooBar -> FooBar

This should probably be done with the Prio functions, since they are implicitly namespaced in Python.

Refactor PrioPRGSeed_randomize

https://github.com/acmiyaguchi/python-libprio/blob/dc6b18558c9876919f7f38fe533a2993eceadd60/tests/test_lib_client.py#L10-L11

The current method of generating a random seed is clumsy. The PRGSeed should be generated every time with a single call.

There should be a custom typemap defined for the PRGSeed functions. It should allocate memory the typemap(in) statement that is managed by a reference-counted PyObject.

Update status badge

Rename `_write_wrapper` and `_read_wrapper`

https://github.com/acmiyaguchi/python-libprio/blob/3755773ec0290d880ac0d3e84790fc30e7dd07d1/libprio.i#L181

The _wrapper suffix is redundant because the original prio functions for read/write are being ignored. This should be exposed directly to the API without the suffix.

Add instructions for setting up AWS S3 configuration

gsutil supports s3 file operations via boto3. There should be documentation around enabling support for s3. This should be as straightforward as passing the appropriate environment variablesAWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY into the docker container.

The authentication method should stat the relevant buckets before proceeding.

prio-processor/processor/bin/process

Lines 307 to 318 in f795495

 function authenticate() { 

 local cred=${GOOGLE_APPLICATION_CREDENTIALS} 

 local test_bucket=${BUCKET_INTERNAL_PRIVATE} 

 if [[ -n "${cred}" ]]; then 

 gcloud auth activate-service-account --key-file "${cred}" 

 else 

 # https://cloud.google.com/kubernetes-engine/docs/tutorials/authenticating-to-cloud-platform 

 echo "No JSON credentials provided, using default scopes." 

 fi 

 gsutil ls "gs://${test_bucket}" 

 }

Missing prio-processor egg for bootstrapping Spark

It looks like the egg for bootstrapping spark jobs is not being built anymore in the container.

[2020-03-12 23:27:25,907] {logging_mixin.py:112} INFO - [2020-03-12 23:27:25,906] {pod_launcher.py:125} INFO -   File "/app/processor/prio_processor/bootstrap.py", line 42, in run
[2020-03-12 23:27:25,907] {logging_mixin.py:112} INFO - [2020-03-12 23:27:25,907] {pod_launcher.py:125} INFO -     raise RuntimeError("missing bdist_egg artifact")
[2020-03-12 23:27:25,941] {logging_mixin.py:112} INFO - [2020-03-12 23:27:25,940] {pod_launcher.py:125} INFO - RuntimeError: missing bdist_egg artifact

Support generic s3v4 buckets for exchanging data

Currently, the container assumes that it will be run on GCP and takes advantage of authorizing against a service account to connect to various buckets. For example, the JSON credentials set in GOOGLE_APPLICATION_CREDENTIALS may be granted access from two separate GCP projects.

It's desirable to run the container agnostic to a cloud provider. MinIO provides a self-hosted object store that implements the s3v4 API. The hadoop-aws module allows compatibility with s3v4 APIs, and I've confirmed that I can read data from MinIO using docker-compose.

I'm proposing the following changes to support generic providers.

Use of access and secret keys on buckets

First, for normal operation of the container, GOOGLE_APPLICATION_CREDENTIALS will no longer be necessary. Instead the following environment variables will need to be set with the HMAC keys for each party e.g.

INTERNAL_ACCESS_KEY
INTERNAL_SECRET_KEY
EXTERNAL_ACCESS_KEY
INTERNAL_SECRET_KEY

These map to the AWS environment variables that are commonly: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. HMAC keys can be made from a GCP service account as follows:

gsutil hmac create SERVICE_ACCOUNT_EMAIL

This is then distributed to the co-processor (partner, party).

MinIO requires a bit of configuration for server-side encryption, which requires TLS or KMS of some kind. It's a bit out of scope here, but it should suffice to use the default user supplied key and password since the shares themselves are encrypted too.

Use of s3a hadoop connector as the default connector

Currently the jobs will use the gs connector to connect to buckets. The s3a connector is compatible with anything that implements s3v4. Performance may vary, so the spark configuration will be important to tune downloads/uploads. The stages should be bottlenecked on CPU though.

Use `mc` (minio client) instead of `gsutil`

gsutil is the google storage commandline utility. While it does support s3 (for aws), it does not work for minio. The minio client on the other hand supports the the major cloud providers, and provides some nice functionality on top like being able to output json.

Create an examples folder

The wrapper_example should be moved into an examples folder. The purpose of the examples folder is to provide basic references for building more complex applications.

Some examples that I want to include:

An example of using the autogenerated lib.prio bindings. browser-test is probably a good example that uses most of header functions.
An example of using the pythonic wrapper. This could involve using numpy (vectorized data) and generating larger amounts of testing data.
An example of using this library asynchronously. This example should make the data-flow easier to understand. It might be useful to include asyncio and a messaging queue.
An example of a jupyter notebook performing analysis on encoded binary data.

The examples should fit in a single file and should include dependencies via a Pipfile.

# Run an example
$ cd examples/browser-test
$ pipenv run python browser_test.py

Add a Dockerfile to build a reproducible runtime environment

The build for this library is currently involved because of the dependencies introduced in libprio. On Fedora 27, I was able to build libprio by installing the following packages:

dnf install nss-devel msgpack-devel

A Dockerfile will also need to include build essentials like make, gcc, and python-devel.

A docker container should make building and testing less of a hassle.

Update Origin Telemetry ETL to read from `telemetry.prio` instead of `payload_bytes_decoded.telemetry__telemetry_prio_v4`

The retention period on data in payload_bytes_decoded.telemetry_telemetry__prio_v4 is 30 days, while telemetry.prio is kept for a substantially longer period of time (25 months). We should consider reading data from from telemetry.prio to ignore any excess columns over the storage API and to take advantage of the stable table deduplication.

`prio-processor` command fails with `pkg_resources.ContextualVersionConflict`

To reproduce:

% docker run -it mozilla/prio-processor:latest prio-processor
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 570, in _build_master
    ws.require(__requires__)
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 888, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 779, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (setuptools 39.2.0 (/usr/lib/python3.6/site-packages), Requirement.parse('setuptools>=40.3.0'), {'google-auth'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/prio-processor", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3095, in <module>
    @_call_aside
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3079, in _call_aside
    f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3108, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 572, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 585, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 779, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (setuptools 39.2.0 (/usr/lib/python3.6/site-packages), Requirement.parse('setuptools>=40.3.0'), {'google-auth'})

The workaround is to invoke the module directly:

docker run -it mozilla/prio-processor:latest bash -c "cd processor; python3 -m prio_processor"

installation steps don't work

Using the recommended instructions in the README (docker build -t prio .), things seem to work fine for a while until...

...
  Downloading https://files.pythonhosted.org/packages/e3/d9/d9c56deb483c4d3289a00b12046e41428be64e8236fa210111a1f57cc42d/virtualenv_clone-0.5.1-py2.py3-none-any.whl
Installing collected packages: virtualenv, enum34, typing, certifi, virtualenv-clone, pipenv
Successfully installed certifi-2019.3.9 enum34-1.1.6 pipenv-2018.11.26 typing-3.6.6 virtualenv-16.4.3 virtualenv-clone-0.5.1
Removing intermediate container 1ed0d90316a5
 ---> 36988accc720
Step 8/13 : ENV PATH="$PATH:~/.local/bin"
 ---> Running in bf77f7a13f09
Removing intermediate container bf77f7a13f09
 ---> 5bd06bb8c8ce
Step 9/13 : WORKDIR /app
 ---> Running in 8fa3943d5ed3
Removing intermediate container 8fa3943d5ed3
 ---> b381387375f0
Step 10/13 : ADD . /app
 ---> 2809c4cd10cb
Step 11/13 : RUN make
 ---> Running in 0a9b6988ec3c
cd libprio && CCFLAGS='-fPIC' scons && cd ..

scons: *** No SConstruct file found.
File "/usr/lib/python2.7/site-packages/SCons/Script/Main.py", line 924, in _main
make: *** [Makefile:2: all] Error 2
The command '/bin/sh -c make' returned a non-zero code: 2

@acmiyaguchi -- did you forget to add something to the repository? Also this would be a good excuse to make CI work. :) Currently circle is saying there are no builds.

Serialize verification and total share packets

Add an example using docker and asyncio with sockets

Some pointers:

Fix `PrioPublickey_export` functions

On the publickey-export branch, there are typemaps defined so export functions read directly into a PyString or PyByteArray.

https://github.com/acmiyaguchi/python-libprio/blob/b0705d1416e8b66d0a2b4a1dd14da52544b022cc/libprio.i#L107-L131

This typemap is unused at SWIG generation run-time. I ran the following command to debug the typemap search.

$ swig -python -debug-tmsearch libprio.i | less

I searched for PrioPublickey_export and found that typemap(in) unsigned char data[CURVE25519_KEY_LEN] was not on the list of typemaps.

This could be based on the order of compilation. CURVE25519_KEY_LEN is defined at the C preprocessor, which is probably run after the SWIG typemap matching. I could also be using SWIG incorrectly for array types.

Migrate `prio-processor staging` job to read from BigQuery instead of ndjson on GCS

Add API documentation

Create a PyPi package

Wiki changes

FYI: The following changes were made to this repository's wiki:

defacing spam has been removed
Restricting write access to contributors is strongly encouraged. Please make that change (documentation).

These were made as the result of a recent automated defacement of publically writeable wikis.

Replace gcsfs with google-cloud-storage

gcsfs is a library for dask that's meant to implement google cloud storage using the file api. This is somewhat of an odd dependency, and should probably be replaced with the standard google-cloud-storage package.

Add pythonic wrappers over the autogenerated SWIG bindings

Memory management is rarely exposed in Python. There should be a python interface into the libprio bindings that transform the imperative C code into an object oriented API.

These objects should be serializable.

Rename the OPAQUE_POINTER macro

https://github.com/acmiyaguchi/python-libprio/blob/3755773ec0290d880ac0d3e84790fc30e7dd07d1/libprio.i#L26

The macro system will replace all instances of the literal T in the code block with the corresponding text. This probably should be renamed to prevent accidental templating. Additionally, it may be useful to rename opaque pointer to something related to capsule.

Lines 32 to 35 in f795495

 function rsync() { 

 local server_id=$1 

 local bucket=$2 

 local dest=gs://${bucket}/raw/