diffix / pg_diffix Goto Github PK

View Code? Open in Web Editor NEW

17.0 3.0 1.0 1.14 MB

Implementation of the Open Diffix anonymization mechanism for PostgreSQL.

Home Page: https://www.open-diffix.org

License: MIT License

Makefile 0.32% C 91.53% JavaScript 1.10% Dockerfile 0.42% Shell 0.48% PLpgSQL 6.04% Batchfile 0.12%

anonymization anonymous anonymize anonymisation anonymity privacy-protection privacy-tools

pg_diffix's Introduction

PG Diffix

pg_diffix is a PostgreSQL extension for strong dynamic anonymization. It ensures that answers to simple SQL queries are anonymous. For more information, visit the Open Diffix website.

For administrators: Check out the admin tutorial for an example on how to set up pg_diffix. See the admin guide for details on configuring and using the extension. To install from source, see the installation section.

For analysts: The banking notebook provides example queries against a real dataset. The analyst guide describes the SQL features and limitations imposed by pg_diffix.

Installation

PostgreSQL version 13 or higher is required.

Linux

You need make, jq, and a recent C compiler. You should already have the postgresql-server-dev-x package installed if you have PostgreSQL version x. If not, you must install it in order to compile the source.

The source is compiled with: make (or make TARGET=release for release version). The compiled extension is installed with: make install (which requires superuser permissions).

The extension is also available on PGXN, and can be installed using PGXN Client.

Windows

You need Visual Studio 2019 (or greater) installed with the "Desktop development with C++" option selected. You also need to set the environment variable PGROOT to point to the location of the PostgreSQL installation.

You can compile the source from inside VS, by opening the provided solution file, or from the command line, by opening a "Developer Command Prompt for VS 20XX" terminal in the project's folder and executing msbuild (to do a release build, execute msbuild -p:Configuration=Release, to clean the build files, run msbuild -t:Clean).

The compiled extension is installed by running install (for debug version) or install Release (for release version).

Activating the extension

You can set up the extension for the current database by using the command CREATE EXTENSION pg_diffix;.

To properly enforce the anonymization restrictions, the extension has to be automatically loaded on every session start for restricted users. This can be accomplished by configuring library preloading.

For example, to automatically load the pg_diffix extension for all users connecting to a database, you can execute the following command:

ALTER DATABASE db_name SET session_preload_libraries TO 'pg_diffix';

Once loaded, the extension logs information to /var/log/postgresql/postgresql-13-main.log or equivalent.

Node dumps can be formatted to readable form by using pg_node_formatter.

Deactivating the extension

You can drop the extension from the current database by using the command DROP EXTENSION pg_diffix;.

You might also need to remove the extension from the list of preloaded libraries.

For example, to reset the list of preloaded libraries for a database, you can execute the following command:

ALTER DATABASE db_name SET session_preload_libraries TO DEFAULT;

Testing the extension

`make installcheck`

Once you have a running server with the extension installed, execute make installcheck to run the tests. You must ensure you have all the required permissions for this to succeed, for example:

In your pg_hba.conf your PostgreSQL superuser to have trust authentication METHOD. If modified, run systemctl restart postgresql.service to apply changes.
Invoke using PGUSER=<postgres-superuser> make installcheck

or if available, just make your usual PostgreSQL user a SUPERUSER.

`PGXN Test Tools`

Or you can use the PGXN Extension Build and Test Tools Docker image:

docker run -it --rm --mount "type=bind,src=$(pwd),dst=/repo" pgxn/pgxn-tools sh -c \
  'cd /repo && apt update && apt install -y jq && pg-start 13 && pg-build-test'

Docker images

We provide 2 Docker images preconfigured with the extension.

Base image

The base image is a standard postgres image with pg_diffix installed and preloaded. It does not include any additional database or user out of the box.

The example below shows how to build the image and run a minimally configured container.

Build the image:

make image

Run the container in foreground and expose in port 10432:

docker run --rm --name pg_diffix -e POSTGRES_PASSWORD=postgres -p 10432:5432 pg_diffix

From another shell you can connect to the container via psql:

psql -h localhost -p 10432 -d postgres -U postgres

For more advanced usage see the official image reference.

Demo image

The demo image extends the base image with a sample dataset and a user for each access level.

Once started, the container creates and populates the banking database. Three users are created, all of them with password demo:

trusted_user with anonymized access to banking in trusted mode
untrusted_user with anonymized access to banking in untrusted mode
direct_user with direct (non-anonymized) access to banking

NOTE The required file docker/demo/01-banking-data.sql is managed by Git LFS.

Build the image:

make demo-image

Run the container in foreground and expose in port 10432:

docker run --rm --name pg_diffix_demo -e POSTGRES_PASSWORD=postgres -e BANKING_PASSWORD=demo -p 10432:5432 pg_diffix_demo

Connect to the banking database (from another shell) for anonymized access:

psql -h localhost -p 10432 -d banking -U trusted_user

To keep the container running you can start it in detached mode and with a restart policy:

docker run -d --name pg_diffix_demo --restart unless-stopped \
  -e POSTGRES_PASSWORD=postgres -e BANKING_PASSWORD=demo -p 10432:5432 pg_diffix_demo

pg_diffix's People

Contributors

Stargazers

Watchers

Forkers

nbs2904

pg_diffix's Issues

Generalize AIDs to single uint64 identifier

We leave int4 and in8 as they are. For strings we use their hash as an AID.

We can have the first parameter of aggregates as ANY, then figure out their type during state initialization.

This will greatly simplify the aggregates.

Can we safely load this as a shared_preload_library?

If not, we can check process_shared_preload_libraries_in_progress and fail early.

Implement contribution spec for sum/avg

Create a generic subquery unroller

Once we have implicit counts we need to unroll them to one row per noisy count.

I suggest adding the count as a junk (or maybe regular) column, then nest this into a subquery and do some cross join magic to unroll rows.

We rewrite this in the post_parse_analyze hook.

Alternatively, SRF (set-returning functions) could be used.

Varchar is not accepted as AID type

ERROR:  Unsupported AID type (OID 1043)

Determine whether security labels are a viable approach to specifying AIDs

https://www.postgresql.org/docs/current/sql-security-label.html

Related: #6

Scale aggregation noise SD with top group average.

Should we get the same seed for different aggregates?

Currently aggregates start from aid_seed=0, meaning the same set of AIDs will produce the same RNG, even for different types of aggregates.

Should each of lcf, count(*), count(col), count(distinct aid) have a different starting seed so that the final hash and therefore the RNG is different?

Watch for relcache invalidations

If we cache config table OIDs we need to CacheRegisterRelcacheCallback

Create a Dockerfile with a Postgres preconfigured to use the extension

This is a continuation of the previous issue: #1

We have install instructions for how to get the extension running, but going forward it would be wonderful to have a Dockerfile where the extension is built, installed, and loaded for you.

Install instructions

@edongashi, how do you get this extension installed for testing?
What would be super neat might maybe be to have a Dockerfile in which the latest version is compiled and installed?
Would be nice for quick testing for someone who doesn't have the full toolchain installed locally.

Make LCF multi-AID aware.

Create a catalog table as part of the extension for configuration

Tables, AIDs, and other relation config need to be managed dynamically. A config table should be created to model this information.

Write SQL tests for new aggregates

Rename repository

We need to move this repository before making it public

Support count(distinct aid)

This should be similar to diffix_lcf aggregate found in pg_diffix/template/agg_lcf_a.h. The only difference would be the final function. In this case it would return the noisy uniq count instead of true/false.

Write non-template contribution tracker

Current idea is an untagged union of double / int64 and a struct with functions.

References/notes

We can register custom plan nodes with RegisterExtensibleNodeMethods [1] [2]

We can register custom scan methods with RegisterCustomScanMethods [1]

We can watch for reloid invalidations with CacheRegisterRelcacheCallback

We can manage extension assets in a separate schema [1]

We can check process_shared_preload_libraries_in_progress during init.

We probably need to check IsBinaryUpgrade & RecoveryInProgress() during init.

We can create our cache memory context with AllocSetContextCreate / MemoryContextResetAndDeleteChildren

PGXN, the PostgreSQL Extension network, is a central distribution system for open-source PostgreSQL extension libraries. https://pgxn.org/

How to scan a table [1] [2] [3]

Tuplestore allows caching of tuples [1]

#if (PG_VERSION_NUM < 120000)
#define table_open(r, l) heap_open(r, l)
#define table_close(r, l) heap_close(r, l)
#endif

Create config cache and MemoryContext

AllocSetContextCreate / MemoryContextResetAndDeleteChildren

Consider creating custom schema for extension exports

Example: https://github.com/citusdata/pg_cron/blob/main/pg_cron.sql

Provide docs how to label tables and roles

We need some reference what labels apply to what.

Initial configuration model for anonymization settings.

Currently, custom variables are used to set the anonymization parameters of the system, e.g.:

SET pg_diffix.default_access_level = 'publish';
SET pg_diffix.noise_seed = 'secret_seed';
SET pg_diffix.minimum_allowed_aids = 3;

I did some reading on this topic and thought about alternative solutions, and, in the end, I think the initial approach here was the correct one.
The Postgres config file recognizes custom variables, so the default values we provide can be changed at server start-up.
Using a configuration table would allow us to persistently change the settings dynamically during run-time, but I don't think that is something we'll need. It would also be more complex.

I suggest to keep using custom variables to configure the anonymization settings of the system.
Later, we can add support for different anonymization profiles, like publish, cloak or knox, and then use security labels to assign a profile to a user.

There is also the question if we should allow setting different anonymization parameters per table or per column. That should be possible using security labels, but it would add significant complexity to the system (in implementation and usage) and I don't think it makes sense to worry about it at this stage. We might never need it.

Censor EXPLAIN for anon queries

Maybe with a hook like https://github.com/AmatanHead/psql-hooks/blob/master/Detailed.md#ExplainOneQuery_hook

First we have to determine whether EXPLAIN is a security risk

Handle insufficient AID cases during aggregation.

Write developer instructions on building artifacts

Is the decision for merging AIDs to int8 definitive?

I'm writing the int8 AID tracker, but I'm getting some second thoughts. Are we sure that we benefit enough to merge AIDs to int8?

We can get rid of the macros in other ways.
AID can be an untagged union and some function pointers can handle the rest. We determine the functions when we create the AID/contribution tracker by examining types sent to the aggregate.

By squeezing AIDs into the same shape we're losing bijection (for strings). We also lose the nice explain features which could be a debugging and teaching utility.

Thoughts?

Spec out and implement access control

Some users need to have raw access and some users need to have anon access.

How to handle `null`-AIDs

This is branching off from: #21

In his comment, @yoid2000 asked some questions. I'll add a comment to his question #2 here.

Regarding 2, can we avoid this question by always knowing what the actual AID is?

The AID would be null if the value is missing in the dataset, rather than as an artefact of the query itself. So it will in fact not always be possible to know what an AID is. In Aircloak Insights we explicitly filtered out the data where there was no AID value present.

In practice, this will lead to severe data loss for certain datasets (I saw this a couple of times with Aircloak insights), but in all the instances that I encountered the problem would not have been solved by including this data. Had it all been represented by a null-AID it would have been suppressed as an extreme outlier. Had it been assigning random AID values in place of the null-value, then we would have lost all ability to make any claims about the resulting anonymity. This all leads me to think we must drop null-AIDed data in Open Diffix too.

Make `count` aggregators multi-AID aware.

Do we include NULL contributors in AID seed?

If an AID contributes with NULL, do we include that AID in the seed material?

Provide instructions how to automatically preload the extension

We need preloading because otherwise hooks won't run and you can query sensitive tables.

We need to research the easiest way to do this and provide simple instructions in the README.

Rewrite count to diffix_count

Swap count(*) and count(any) aggregates with the respective diffix_count versions. Aggregate OIDs can be looked up in OidCache found in "pg_diffix/oid_cache.h".

Define DEBUG token externally

Currently DEBUG is hardcoded here https://github.com/diffix/extension-prototype/blob/65e94fc021a066e05c36ff9d0733a1d282df473d/pg_diffix/utils.h#L4

We need to make it so the build process determines whether we're making a dev or prod build. Ideally an env variable should be used to indicate this.

Write non crashing operators as UDF/UDO

https://www.postgresql.org/docs/current/xoper.html

Write C level udf/operators (mark as immutable)
Get their OIDs
Rewrite operators to UDF/UDO

I think operators is better for optimization.

Figure out whether OIDs from pg_type_d are backwards compatible

Header catalog/pg_type_d.h contains OIDs for basic types. We use this to identify types such as any, int4, text, and so on.

If this is incompatible across versions then we need to dynamically fetch these from pg_type.

Implement contribution spec for min/max

Scan diffix_config like a systable

Maybe the following is more correct:

https://github.com/citusdata/pg_cron/blob/1ac5c0f45baf1cabafbae7de1f52a030f2ec823a/src/job_metadata.c#L635-L705

Reject unsupported aggregates during anon query validation

verify_anonymization_requirements in validation.c must be extended to check for currently unsupported aggregates such as SUM, MAX, etc. I think whitelisting is better than blacklisting.

To know which aggregates to allow we can check OidCache.postgres.

Resolve type OID of configured relations' AIDs

Prerequisite for #11 and #12. We need the type of AID to know which aggregate overload to inject.

Currently we only support text and int4 AIDs.

Test extension against previous postgres versions and determine min. compatible version

SET parameter command is not persistent

If we change custom variables using SET x = y, these changes apply to the current session only.

The SET command allows modification of the current value of those parameters that can be set locally to a session; it has no effect on other sessions. The corresponding SQL function is set_config(setting_name, new_value, is_local)

Only superusers can modify these anyway, but for a persisting solution these need to be put in the config file.
We have to determine if the config file recognizes custom variables.

We need to update RTE selectedCols when rewriting

If I write the following query SELECT count(name) FROM users which gets rewritten to diffix_count(id, name) I get the following RTE:

{RTE
  :alias <>
  :eref {ALIAS
    :aliasname users
    :colnames (
      "id"
      "name"
    )
  }
  :rtekind 0
  :relid 136662
  :relkind r
  :rellockmode 1
  :tablesample <>
  :lateral false
  :inh true
  :inFromCl true
  :requiredPerms 2
  :checkAsUser 0
  :selectedCols (b 9)
  :insertedCols (b)
  :updatedCols (b)
  :extraUpdatedCols (b)
  :securityQuals <>
}

When I write SELECT diffix_count(id, name) FROM users directly I get this RTE:

{RTE
  :alias <>
  :eref {ALIAS
    :aliasname users
    :colnames (
      "id"
      "name"
    )
  }
  :rtekind 0
  :relid 136662
  :relkind r
  :rellockmode 1
  :tablesample <>
  :lateral false
  :inh true
  :inFromCl true
  :requiredPerms 2
  :checkAsUser 0
  :selectedCols (b 8 9)
  :insertedCols (b)
  :updatedCols (b)
  :extraUpdatedCols (b)
  :securityQuals <>
}

These are the same except for selectedCols: :selectedCols (b 9) vs :selectedCols (b 8 9).
The (b ...) indicates a bit list. We probably need to match this as the planner could use this data to determine what it needs to look up.

Organize project in areas

As the project is growing we get more and more files in a flat structure. I think we need to organize headers and source files logically.

Loose example of what I'm suggesting:

.
├── aggregation
│   ├── aid.h
│   ├── aid_tracker.h
│   ├── contribution_tracker.h
│   └── random.h
├── config.h
├── hooks.h
├── query
│   ├── context.h
│   ├── node_helpers.h
│   ├── oid_cache.h
│   ├── rewrite.h
│   └── validation.h
└── utils.h

Not happy with aggregation but that was the first thing which came to mind.

Thoughts?

Create documentation template with getting started etc.

We do not rewrite EXPLAINed query

If we EXPLAIN SELECT count(*) FROM sensitive_table then this is considered as a non-anon query

Rewrite queries to include HAVING diffix_lcf for buckets

This ensures buckets pass LCF. Aggregate OIDs can be looked up in OidCache found in "pg_diffix/oid_cache.h".

What if we have a global aggregate? This shouldn't be suppressed.
With grouping sets this can get a little complicated.

Config is prone to getting stale

We load config once at startup. This has issues if we remake tables or if tables don't exist yet at time of activation.

Some possible approaches:

Always fetch table OIDs at time of querying... potentially slow?
Detect schema changes automatically... if possible
Periodically expire config