Giter VIP home page Giter VIP logo

psi's Introduction

OpenDP logo

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. License: MIT

Python R Rust

main CI nightly CI

The OpenDP Library is a modular collection of statistical algorithms that adhere to the definition of differential privacy. It can be used to build applications of privacy-preserving computations, using a number of different models of privacy. OpenDP is implemented in Rust, with bindings for easy use from Python and R.

The architecture of the OpenDP Library is based on a conceptual framework for expressing privacy-aware computations. This framework is described in the paper A Programming Framework for OpenDP.

The OpenDP Library is part of the larger OpenDP Project, a community effort to build trustworthy, open source software tools for analysis of private data. (For simplicity in these docs, when we refer to “OpenDP,” we mean just the library, not the entire project.)

Status

OpenDP is under development, and we expect to release new versions frequently, incorporating feedback and code contributions from the OpenDP Community. It's a work in progress, but it can already be used to build some applications and to prototype contributions that will expand its functionality. We welcome you to try it and look forward to feedback on the library! However, please be aware of the following limitations:

OpenDP, like all real-world software, has both known and unknown issues. If you intend to use OpenDP for a privacy-critical application, you should evaluate the impact of these issues on your use case.

More details can be found in the Limitations section of the User Guide.

Installation

Install OpenDP for Python with pip (the package installer for Python):

$ pip install opendp

Install OpenDP for R from an R session:

install.packages("opendp", repos = "https://opendp.r-universe.dev")

More information can be found in the Getting Started section of the User Guide.

Documentation

The full documentation for OpenDP is located at https://docs.opendp.org. Here are some helpful entry points:

Getting Help

If you're having problems using OpenDP, or want to submit feedback, please reach out! Here are some ways to contact us:

Contributing

OpenDP is a community effort, and we welcome your contributions to its development! If you'd like to participate, please contact us! We also have a contribution process section in the Contributor Guide.

psi's People

Contributors

jackmurtagh avatar jkaashoek avatar raprasad avatar shoeboxam avatar tercer avatar turbofreeze avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

psi's Issues

Large negative values released in histogram

A user reports large negative values inside a histogram release:

I'm playing around with the PSI tool using the online interface and the California Demographic Dataset.
I requested a data release with the following setting:
image-1
I got the following answer:

Splash Result Page

Global Values

Epsilon: 0.5

Delta: 0.000001

Beta: 0.05

Data Size (n): 1000

Variable 1: age

Histogram Releases: 4, 0, 0, 591, 900, 1029, 1021, 1146, 1063, 937, 761, 611, 496, 407, 375, 316, 216, -8933, 60, 0

The -8933 count surprises me. The way I read the parameters, the Error=11.98 means that I'm the probability of getting an absolute error greater than 11.98 is 5%. I would assume then that the probability of getting an error of at least 8933 is astronomically small. Am I misunderstanding something? Is there a bug?

Uploading Datasets

  • For utility study, need the functionality for data depositor to upload own dataset
  • Feature that allows for local instance uploads? (if data depositor does not want to share dataset with main PSI)

Add variable transformer

There is a Rook app that allows testing of proposed variable transforms. It uses Haskel. It was previously running in local mode but needs to be set up in the GCE.

The app is verifyTransform.app found in rookTransform.R here: https://github.com/TwoRavens/PSI/blob/develop/rook/rookTransform.R

The underlying code is here: https://github.com/TwoRavens/PSI/blob/88f1047c65a34e5784818376a27c43eb717665a3/rook/dpmodules/Jack/transform.R

Here is where it ought to be called, but isn't: https://github.com/TwoRavens/PSI/blob/88f1047c65a34e5784818376a27c43eb717665a3/rook/dpmodules/Jack/transform.R#L65-L68

adjust psi-library location

The psi-library changed organization from IQSS to privacytoolsproject, which will break the devtools command to install that library. Need to fix.

More feedback from Salil

  • We should prominently state on psiprivacy.org that this is still a prototype (not safe for actual use yet) and encourage feedback (with instructions on how to provide it!)

  • we should not title the section "Secrecy of the Sample" which will mean nothing to an end user. We should call it "Population Size" and use it to explain what that parameter means and the implication of using it.

  • can't see percentage under reserve budget slider (looks like it's there but below the fold and I can't find a way to scroll low enough to see it)

  • should give more suggestions for improving accuracy than releasing holds, as it may be that the total privacy budget can't even support the desired accuracy. other options are adjusting the statistic's metadata, decomposing the desired statistic into simpler ones that might be computable with better accuracy, or considering alternative statistics that would have similar usefulness but might be computable with better accuracy.

  • explicitly affirm (not just read text) that any individual's data can only affect one row.

Epsilon does not update properly with each release

In the IQ interface, when the user submits a release, the epsilon does not update properly. Example in screenshot: Total Epsilon = 1.0, Batch Epsilon = 0.1, Remaining Epsilon for next session = 0.4 (should be 0.9?)

Screen Shot 2019-07-24 at 2 21 14 PM

User profile/budget/metadata storage

depends on #4

This will eventually be several issues. Reference issue from the fall:

  • IQSS/dataverse#4063
    • Read the issue + 3 related issues
    • For a prototype, the privacy preserving metadata will be stored in this tool, not Dataverse

  1. Design with James/Jack what should be saved, how it should be saved, etc. This may include:
    • budgets per dataset
    • query results, e.g. summary metadata (JSON)
    • etc
  2. API endpoints for saving/retrieving data

Redesigning budgeter workflow

Adding a progress bar to indicate progress through initial series of modals.

Considering alternatives to using modals.

Soliciting and incorporating feedback from users.

branch setup / docker hub / travis build

  • travis.com setup (note: differs from travis.org)
  • test created docker images
  • add rookconfig.R with env variables to include
    • production flag
    • shared data directory
  • add rookconfig.R to apps to avoid redundant settings
  • Dockerfile
    • set production flag
    • set shared data directory

update deployment

  • placeholder for any deployment-related changes
  • put in a periodic health check and automatically relaunch when down.

PSI preparation feedback

This list is from Salil's 10/15 notes.

  • Functioning epsilon and delta remain blank when entering secrecy of sample
  • Metadata
  • Privacy loss parameters: should let them select from the 5 options, set of options, or enter their own
  • Should require an active choice of privacy parameters (can't just close the window) unless clearly chosen to be in an "experimental" mode.
  • Window to give bin names comes up even if no categorical variables
  • Need to specify what is coding of boolean variables
  • Need to specify for syntax of how to enter bins (comma separated?). Should also allow unbounded/infinite bin space. And should include instructions for the shorthands like A:E.
  • Exponential notation for delta should be the default.
  • Are granularity and number of bins capturing different things for a numerical variables? The metadata for all statistics on a single variable shouldn't be forced to be the same; these are parameters for the statistics not about the variable. Should have per-statistic metadata, not just per-variable. The per-variable ranges should be used as the defaults for the per-statistic ranges, but one should be able to edit them for the particular statistics.
  • When I enter something it's not clear what I need to do to complete the entry of metadata for a statistic - I just guessed to click on a random spot on the window. Feels like there should be a "done" button.
  • The metadata warning (about not using the actual ranges) feels out of place on the statistics entry screen, it really should show up only when one is required to enter statistics, not when they are prepopulated. In that case, it can appear with the help button or maybe automatically when they start to edit a range.
  • No place to edit formula for a transformed variable or remove a transformed variable once it's been created.
  • When I defined ratio <- income/education, it classified ratio as categorical rather than numerical.
  • Need a way to be able to delete a transformed variable.
  • [-] When I switch to a new variable, the accordions for the other variables should get automatically minimized.
  • Multivariate statistics have range values empty again even though they've already been entered at the start.
  • When Data Ranges page comes up again after creating a transformed variable, the ranges for the other variables are all blank even though they've already been entered. Better to only ask for a range for the new variable.
  • "objective function for the regression" has no scale
  • The error measure for quantiles should be normalized by n. Then it has a natural interpretation. If the error measure is .1, then what we report as a median will actually be between the 49th and 51st percentile.
  • Edit parameters should bring up the same text as in the page when epsilon and delta where originally set (interpretations of epsilon). Also should include a link to primer section on interpretations of epsilon.
  • Error bounds for means look too large to me if n=500,000 as I recall. What is the formula that's being used?
  • We need to warn them about consequences of entering overly large or small ranges.
  • With probability at least .95 in help text for error
  • ? Delete variable didn't actually delete the statistic.
  • We should show n, the number of records, somewhere.

footer arrangement

The current footer looks like:

screen shot 2018-10-03 at 1 19 00 pm

To save vertical space, should right justify the submit button under the stats summary table, and maybe left justify the show table button.

Allow system to use google cloud storage (buckets)

Persistent file storage will be needed, including temporarily storing source data as well as longer term storage of metadata. Options include:

When deployed via k8s on gce, any files downloaded or created should use this persistent storage

What scenarios need to be handled?

  • Download dataverse metadata as well as source files and save them to a bucket
  • Upload a source file and save it to a bucket
  • When appropriate, pass references to storage objects between the containers (python, R)
  • ???

create develop branch; deploy from master

  • Make a develop branch which becomes the default working branch.
  • update .travis to reflect 3 tags:
    • develop branch -> latest tag
    • master branch -> master tag and deploy tag
  • update k8s deploy to use containers tagged master
  • The master branch would be used for deploying to a demo version on google computer engine or similar.

remove VGAM from the R code

Although not explicitly loadded by rookSetup.R, the currently codebase uses VGAM.

-> reduce, remove VGAM use

Fix help modals and help text

Opening help modals incorrectly regenerate the entire interface with unexpected and chaotic results.

Help texts when clicking on specific text for more detailed instructions to appear in the page header no longer appear and need to be brought back.

Variable type modal window ignores newly constructed variables

The variable transformer functionality creates new variables. They are currently created with type unidentified. Thus the type has to be set by type setting modal window. However, this is populating the variable list by the original variable list, not the current one.

rookSetup appList bug

Running rookSetup.R occasionally gives error:

Error in appList[[i]] : subscript out of bounds

Production toggles in R code

There are some remaining production toggles in R code. The primary issue is that screen output has no location to go when on the server, and particularly in R Apache this causes significant issues, so a sink is set up for output in production.

Any of these that remain should be updated to become argument calls.

Toggle between data sets

  1. On login, user is prompted with data sets that are available to work on

  2. When user exits, changes they made to the data set should be saved for them to come back to later

Usability Testing Notes

Dataverse Usability Testing notes from Jayshree and Sophia

Recommendations

Overall

  • "disable" -> "x" similar to browser window with minimize/close
  • disable numeric epsilon/delta in the startup window or make more hidden
  • a user misinterpreted more error to be better (use color to indicate if it is a useful release, but don't scare them about leaking data)

Users hesitant to use privacy budget

  • (provide a public demo dataset anyone can access)
  • embedded tutorial videos

MIKE IDEA: replace left panel variable list with collapsible box for each variable

Budgeter

  • Privacy gains for population size not clearly shown to user/emphasized
    (maybe move to its own step in the popup)
  • include one regular and one multivariate statistic on center panel on load
  • preset workspaces
  • for missing values: add a drop option (listwise deletion)
    IDEA: similar barchart to interactive, but break up by statistic/release

Explain drawbacks of range (include in prompt), assurance on DP regardless of range

  • too small of a range is biased
  • too large is high variance

Interactive

  • ranges and variable types should be set by depositor (should be visible, not editable)
  • show starting epsilon (no longer visible after a release)
  • "title" html attribute for sections of the bar graph thing, showing the epsilon of the section
  • lots of bugs

Route rook calls through django

Remove alert when variable has no type information

Variables sometimes have no type information. This can be an error, if the variable is present in the original file but no estimate is in the metadata, however, even this error is correctly fixable in the current workflow. Moreover, when new variables are created by transforms, this will always (correctly) occur. So need to remove the alert to the user, since it is confusing.

add project contributor readme

  • set master branch--the deployed branch
  • work off a develop branch
  • issues + github board
  • creating branches based on issue #s
  • merge back
    • tests + possible code review
    • do we want 2 or 3 gatekeepers who can review pull requests?
      • probably yes

Add django to the project

Initial integration step: Have django serve the html and css with UI calls still going directly to rook.



There will be a fair amount of code/path re-working here, including:

  1. Adding a top-level "assets" directory to this repository for the js/css/images.
    • These can be served via django's static files: https://docs.djangoproject.com/en/2.0/howto/static-files/
      - (At a later stage, if a .js framework is used, this folder can we used by webpack)
    • Several libraries will need to be added here, including the css/js/images under the UI directory. Examples:
      • UI/code/libraries/
      • UI/controls/
      • UI/css/
      • UI/images/
      • UI/lib/
      • .css, .js, files
      • etc., etc.
  2. Adding a top-level "templates" directory to this repository for the html files. These files can be served via django views.

Notes from Sophia’s talk

  • When you hit the red question mark next to a statistic, don’t use the word “true” (e.g. “true mean”). Instead, use “sample mean”, “dataset mean” or “mean of the dataset”

  • Move blue metadata warning text to early modal screens instead. Include drawbacks of incorrect ranges (too small may cause bias, too large causes high variance)

  • Put somewhere in writing that nothing you do in budgeter is final until you hit submit

  • Reserve budget slider behaves weirdly if you slide it farther to the right than you can afford (if a statistic is being held)

  • Fix + sign in transformer

  • Tutorial should pop up automatically in user trials

Open for discussion:

  • Potentially add standard deviation

  • Default to hiding privacy paramaters for analysts. For depositors too?

  • Should analysts see the same opening modals as depositors?

Storing complete metadata in JSON file

Data for releases is currently written by library functions into dpRelease R object. These will be updated to ensure all desired metadata exists in the object, then need to be stored into the JSON file according to the defined schema.

Move Rook apps to Flask

Rook has limited support and user base, and problems with large arguments from large metadata. Likely hard to support in future, and occasionally difficult to work with at present. Move to Flask wrapped R functions.

psiprivacy.org; use external db

The current test version uses a sqlite db on the container--e.g. it's deleted each time k8s is restarted.

Once users are added (#3) and dataverse files are used (#23), convert to a persistent sql database.

add external, static files directly to project

Issue: Trying to run PSI locally/offline and it failed.

Download/add these files to the project instead of cdn/external usage

<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
  | <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
  | <script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.10.3/jquery-ui.min.js"></script>
  | <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
  | <script type="text/javascript" src="/static/lib/d3.v3.min.js"></script>
  |  
  | <link rel="stylesheet" type="text/css" href="https://rawgit.com/mitsos1os/bootstrap-slider/master/dist/css/bootstrap-slider.min.css">
  | <script type="text/javascript" src="https://rawgit.com/mitsos1os/bootstrap-slider/master/dist/bootstrap-slider.min.js"></script>

Lose statistics if you ask for error too small

If you ask for an error so small that it can not be satisfied (and get the warning message) you then (sometimes?) lose a statistic from the table. Not necessarily the statistic that you just tried to change. Here I tried to change mean(age) and lost histogram(age):

screen shot 2019-03-04 at 5 02 42 am

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.