uber-archive / sql-differential-privacy Goto Github PK

View Code? Open in Web Editor NEW

392.0 31.0 73.0 228 KB

Dataflow analysis & differential privacy for SQL queries. This project is deprecated and not maintained.

License: MIT License

Scala 100.00%

sql

sql-differential-privacy's Introduction

Overview

(This project is deprecated and not maintained.)

This repository contains a query analysis and rewriting framework to enforce differential privacy for general-purpose SQL queries. The rewriting engine can automatically transform an input query into an intrinsically private query which embeds a differential privacy mechanism in the query directly; the transformed query enforces differential privacy on its results and can be executed on any standard SQL database. This approach supports many state-of-the-art differential privacy mechanisms; the code currently includes rewriters based on Elastic Sensitivity and Sample and Aggregate, and more will be added soon.

The rewriting framework is built on a robust dataflow analyses engine for SQL queries. This framework provides an abstract representation of queries, plus several kinds of built-in dataflow analyses tailored to this representation. This framework can be used to implement other types of dataflow analyses, as described below.

Building & Running

This framework is written in Scala and built using Maven. The code has been tested on Mac OS X and Linux. To build the code:

$ mvn package

Example: Query Rewriting

The file examples/QueryRewritingExample.scala contains sample code for query rewriting and demonstrates the supported mechanisms using a few simple queries. To run this example:

mvn exec:java -Dexec.mainClass="examples.QueryRewritingExample"

This example code can be easily modified, e.g., to test different queries or change parameter values.

Background: Elastic Sensitivity

Elastic sensitivity is an approach for efficiently approximating the local sensitivity of a query, which can be used to enforce differential privacy for the query. The approach requires only a static analysis of the query and therefore imposes minimal performance overhead. Importantly, it does not require any changes to the database. Details of the approach are available in this paper.

Elastic sensitivity can be used to determine the scale of random noise necessary to make the results of a query differentially private. For a given output column of a query with elastic sensitivity s, to achieve differential privacy for that column it suffices to smooth s according to the smooth sensitivity approach to obtain S, then add random noise drawn from the Laplace distribution, scaled to (S/epsilon) and centered at 0, to the true result of the query. The smoothing can be accomplished using the smooth sensitivity approach introduced by Nissim et al.

The file examples.ElasticSensitivityExample contains code demonstrating this approach directly (i.e., applying noise manually rather than generating an intrinsically private query).

To run this example:

mvn exec:java -Dexec.mainClass="examples.ElasticSensitivityExample"

Analysis Framework

This framework can perform additional analyses on SQL queries, and can be extended with new analyses. Each analysis in this framework extends the base class com.uber.engsec.dp.sql.AbstractAnalysis.

To run an analysis on a query, call the method com.uber.engsec.dp.sql.AbstractAnalysis.analyzeQuery. The parameter of this method is a string containing a SQL query, and its return value is an abstract domain representing the results of the analysis.

The source code includes several example analyses to demonstrate features of the framework. The simplest example is com.uber.engsec.dp.analysis.taint.TaintAnalysis, which returns an abstract domain containing information about which output columns of the query might contain data flowing from "tainted" columns in the database. The database schema determines which columns are tainted. You can invoke this analysis as follows:

scala> (new com.uber.engsec.dp.analysis.taint.TaintAnalysis).analyzeQuery("SELECT my_col1 FROM my_table")
BooleanDomain = my_col1 -> False

This code includes several built-in analyses, including:

The elastic sensitivity analysis, available in com.uber.engsec.dp.analysis.differential_privacy.ElasticSensitivityAnalysis, returns an abstract domain (com.uber.engsec.dp.analysis.differential_privacy.SensitivityDomain) that maps each output column of the query to its elastic sensitivity.
com.uber.engsec.dp.analysis.columns_used.ColumnsUsedAnalysis lists the original database columns from which the results of each output column are computed.
com.uber.engsec.dp.analysis.histogram.HistogramAnalysis lists the aggregation-ness of each output column of the query (i.e. whether or not the output is an aggregation, and if so, which type).
com.uber.engsec.dp.analysis.join.JoinKeysUsed lists the original database columns used as equijoin keys for each output column of the query.

Writing New Analyses

New analyses can be implemented by extending one of the abstract analysis classes and implementing transfer functions which describe how to update the analysis state for relevant query constructs. Analyses are written to update a specific type of abstract domain which represents the current state of the analysis. Each abstract domain type implements the trait com.uber.engsec.dp.dataflow.AbstractDomain.

The simplest way to implement a new analysis is to use com.uber.engsec.dp.dataflow.dp.column.AbstractColumnAnalysis, which automatically tracks analysis state for each column of the query independently. Most of the example analyses are of this type.

New analyses can be invoked in the same way as the built-in example analyses.

Reporting Security Bugs

Please report security bugs through HackerOne.

License

This project is released under the MIT License.

Contact Information

This project is developed and maintained by Noah Johnson and Joe Near.

sql-differential-privacy's People

Contributors

Stargazers

Watchers

sql-differential-privacy's Issues

No suitable driver found for jdbc:calcite

I packaged the application by using maven-assembly-plugin. However, when I try to run it with java -jar sql-differential-privacy-0.1.2-SNAPSHOT-jar-with-dependencies.jar it throws the error as follows:

Exception in thread "main" Exception in thread "Timer-0" java.lang.RuntimeException: java.sql.SQLException: No suitable driver found for jdbc:calcite:
        at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:159)
        at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:114)
        Caused by: java.sql.SQLException: No suitable driver found for jdbc:calcite:
        at java.sql.DriverManager.getConnection(Unknown Source)
        at java.sql.DriverManager.getConnection(Unknown Source)
        at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:153)
        ... 17 more
java.lang.RuntimeException: java.sql.SQLException: No suitable driver found for jdbc:calcite:
        at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:159)
        at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:114)
        at java.util.TimerThread.mainLoop(Unknown Source)
        at java.util.TimerThread.run(Unknown Source)
Caused by: java.sql.SQLException: No suitable driver found for jdbc:calcite:
        at java.sql.DriverManager.getConnection(Unknown Source)
        at java.sql.DriverManager.getConnection(Unknown Source)
        at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:153)
        ... 8 more

I found this link regarding this issue but I'm not sure what might be a workaround for this.
Any suggestions?

Thanks!

Sample and Aggregate Rewritten Queries

I'm using FLEX to create a simple demo of differentially private queries with a real hive database running the rewritten queries. The problem I'm having is that the rewritten sample and aggregate queries do not execute on my hive database. Hive does not like the underscores in front of variables and the use of UNION in the rewritten query while parsing. Are there any databases that are confirmed to work with all rewritten queries? Perhaps I'm using a newer version of hive that's incompatible with this build of FLEX? Any help or pointers would be nice.

Tracker Attack Question

Unless I'm misunderstand something, re-writing the query instead of creating a differentially private dataset allows for a tracker attack, repeatedly invoking a query that returns a small dataset and then using aggregate queries against that collection of small datasets to glean identifying information. Are there any best practices when using this library to mitigate that risk?

High Noise

Hello,
When I ran the Elastic Sensitivity example, I found the results are very noisy. I got around 9% of noise. Why is that?

Is the test DB available?

Is the "test" DB available for download? I was not able to find it in the repository. Thanks.

Differential privacy of example.

I may be misunderstanding, but the com.uber.engsec.dp.util.DPExample example you have doesn't seem to correspond to differential privacy.

object DPExample extends App {
  // Use the table schemas and metadata defined by the test classes
  System.setProperty("schema.config.path", "src/test/resources/schema.yaml")

  // example query: How many US customers ordered product #1?
  val query = """
    SELECT COUNT(*) FROM orders
    JOIN customers ON orders.customer_id = customers.customer_id
    WHERE orders.product_id = 1 AND customers.address LIKE '%United States%'
  """

  // query result when executed on the database
  val QUERY_RESULT = 100000

  // privacy budget
  val EPSILON = 0.1

  println(s"Query: $query")
  println(s"Private result: $QUERY_RESULT\n")

  (1 to 10).foreach { i =>
    val noisyResult = DPUtils.addNoise(query, QUERY_RESULT, EPSILON)
    println(s"Noisy result (run $i): %.0f".format(noisyResult))
  }
}

Obviously, it is hard to tell because QUERY_RESULT is just fabricated here, but if we are expected to view this as "run the query; imagine you got these results", it does not appear to have the property of differential privacy.

It looks like you are assuming that orders.customer_id has maximum multiplicity one. Otherwise I could have a customers relation with one customer, FRANK, and an orders relation with 100,000 orders for each of my collection of Ted Cruz sex dolls. The reported results, all of which are tightly concentrated around 100,000 would reveal that I am present in the customers relation, whereas had I withheld my record from customers the answers would be tightly concentrated around zero.

I'm not sure how the rest of your pipeline works, but perhaps you could clean up the example or explain the rationale for why it has differential privacy for the stated query.

EDIT: Oho! You do have assumptions about the data, laid out in https://github.com/uber/sql-differential-privacy/blob/master/src/test/resources/schema.yaml. This would be super helpful to point out! :D What happens when these assumptions are violated?

EDIT2: It looks like in the orders relation the customer_id field has maxFreq: 100. So let's assume I order 100 Ted Cruz blow-up dolls. My presence or absence in customers will change the output by +/-100, which it looks like would not be masked by your noise addition, which when I run produces results like

Query: 
    SELECT COUNT(*) FROM orders
    JOIN customers ON orders.customer_id = customers.customer_id
    WHERE orders.product_id = 1 AND customers.address LIKE '%United States%'
  
Private result: 100000

Noisy result (run 1): 100006
Noisy result (run 2): 100008
Noisy result (run 3): 100000
Noisy result (run 4): 99999
Noisy result (run 5): 100001
Noisy result (run 6): 99999
Noisy result (run 7): 100007
Noisy result (run 8): 99992
Noisy result (run 9): 100018
Noisy result (run 10): 100002

These seem to be clearly 100,000 and certainly not 99,900, which is what the result would be without my hypothetical record. What am I missing?

rewrite DP queries for SQL Server

The query rewriter on the example generated queries that are compatible with Vertica Database.
I know you mentioned in your paper and this repo that the query generated by this method (elastic sensitivity and FLEX) can be executed against any SQL database.
For example the Generated query in vertica is:

SELECT COUNT(*) + 20.0 * (CASE WHEN RAND() - 0.5 < 0 THEN -1.0 ELSE 1.0 END * LN(1 - 2 * ABS(RAND() - 0.5)))
  FROM (SELECT timezoneOffset
  FROM TSMC2014) t
  WHERE timezoneOffset = '540'

the SQL Server version of this would be (Difference in LN vs. LOG):

SELECT COUNT(*) + 20.0 * (CASE WHEN RAND() - 0.5 < 0 THEN -1.0 ELSE 1.0 END * LOG(1 - 2 * ABS(RAND() - 0.5)))
  FROM (SELECT timezoneOffset
  FROM TSMC2014) t
  WHERE timezoneOffset = '540'

similar for MOD(n,m) in vertical which is equivalent to n % m in SQL Server and perhaps there are more functions that are not compatible.

How can we get this working with SQL Server? In other words what is the most optimized way to make this code compatible with other SQL Databases (e.g. SQL Server)
Thanks