Light

fjoniyz / ganges Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 2.0 9.19 MB

A project for the module (Advanced) Distributed Systems Prototyping at the Technical University of Berlin.

License: MIT License

Java 13.20% Python 4.11% Dockerfile 0.03% Jupyter Notebook 80.70% JavaScript 1.87% HTML 0.03% CSS 0.06%

kafka streaming-data-processing

ganges's Introduction

Hi there 👋! This is Fjoni!

About me 😄

computer science student at TU Berlin
interested in software engineering, distributed systems

You can contact me 📫

ganges's People

Contributors

Stargazers

Watchers

Forkers

overflw nomorehumor

ganges's Issues

First approach: Kafka proxy

We need to implement a way to change the data in Kafka before writing into the topic. One option was using a broker as a type of proxy.

CastleGuard only accepts one sensitive Attribute

@sofia-001 and me wondered why the CASTLEGUARD Implementation only allows one sensitive Attribute, which is used to specify which column will be L-Diversified. Since it should be possible to L-diversify more the one column, this seems like an issue with the example implementation we used as a base.

We also asked Philip for help regarding this Issue and he basically confirmed that we should be able to specify more than one sensitive Attribute. I will post his full reply below.

Find dataset to run through Kafka

We need to find a dataset to run through Kafka.
What this means:

Retrieve the data from a file(e.g. csv), and write each of the entries in the file in a Kafka topic
After the entries are written in the Kafka topic, one should create a subscriber to this topic and read the data

For any suggestions of data sets you can pack them here as a comment and after that we can choose which one we want to use 😄

Implement Delta-DOCA in java

CASTLEGUARD - Cluster

Implementing Clusters for storing and generalizing Data

Generalization and addition of newly arriving data
- based on the data that are already in the cluster ("range").
Calculation of the increase of the range when a new data point is added.
Calculation of the increase of the range when two clusters are merged.
Query the information content of the cluster -> information loss
- Research fitting metrics for this purpose

CASTLEGUARD reference: cluster.py, range.py

DevDiary: Maxim

Edit DataGenerator Output

grobe Datenstruktur für den Ladesäulen-Use-Case

Datagenerator - add streaming mode

Create Wiki/Docs

Wiki-Documentation - Add CASTLEGUARD and delta-DOCA

Add the docu for the CASTLEGUARD algorithm to the wiki

DevDiary: Sofia

Finish visualization prototype setup

We want to showcase our setup, visualizing our hard-coded example csv data from ampeers in our grafana setup.
The data should be manipulated in our streams-application.

CASTLEGUARD - Castle integration tests

CASTLEGUARD - Cluster and Cluster Operations Unit Tests

Test the cluster and cluster operations functions with unit tests

Plan: Visualization of data annonymization

We want to have a visualization tooling to make our anonymization efforts transparent and compare them to the plaintext data.

Create and run Kafka locally

TODO:

Have a running Kafka instance locally

Useful links:

https://kafka.apache.org/quickstart
https://stackoverflow.com/questions/34081336/classpath-is-empty-please-build-the-project-first (common problem during installation)

We can add in the following list which one of us has completed the task already in order to keep track.

Create Pitch: First Iteration

CASTLEGUARD - code documentation

Add documentation for the methods in the code
Use Java Docs

Define Interface - Anonymisation integration with Kafka Streams

Create 2. Presentation

Create presentation / demonstration (26th May) for Ampeers

DevDiary: Rafael

Json Parser

Based on the sensible data point keys and anonymization algos, we need to parse the incoming topic messages and pass them to the algo class.

Multithreading

Run anonymization in different threads that process different topics

Decide on annonymization-algos to be used

Second approach: Kafka interceptor

CASTLEGUARD Implementation

See sub-issues:

Clean up the branches

Delete the unused or irrelevant branches
Merge the relevant ones

Bug: Split-l

I think I found the issue with the Split-L implementation.

Lines 546-550 should represent this part of the pseudocode:

foreach sub-cluster SC(i) in SC do
    foreach tuple t(i) in SC(i) do
        Let G(t(i)) be set of tuples in C, such that G(t(i)) = { t in C | t.pid = t(i).pid}
        Insert G(t(i)) into SC(i)
        Delete G(t(i)) from C

L. 546-550:

for c in sc:
    for t in c.contents:
        G = [t_h for t_h in C.contents if t_h['pid'] == t['pid']]
        for _ in G:
            c.insert(t)

It seems that instead of adding the tuples from G to the cluster, the tuple from which G is calculated is added. So basically the snipped should look like this:

for c in sc:
    for t in c.contents:
        G = [t_h for t_h in C.contents if t_h['pid'] == t['pid']]
        for t_h in G:
            c.insert(t_h)

However, In our implementation this seems to be implemented correctly:

ganges/src/main/java/com/ganges/lib/castleguard/CastleGuard.java

Lines 396 to 409 in 814bbc7

 for (Cluster cluster : sc) { 

 for (Item tuple : cluster.getContents()) { 

 List<Item> g = new ArrayList<>(); 

 for (Item t : c.getContents()) { 

 if (Objects.equals(t.getData().get("pid"), tuple.getData().get("pid"))) { 

 g.add(t); 

 } 

 } 

 for (Item t : g) { 

 cluster.insert(t); 

 } 

 } 

 bigGamma.add(cluster); 

 }

Quality Benchmark

Bestimmte Auswertungen sollen auf Rohdaten genau so gut funktionieren, wie auf den anonymisierten Daten.
Das Problem wird dabei an folgendem Beispiel verdeutlicht:
Anhand der Rohdaten wird eine Prognose erstellt, wie der Stromverbrauch in einem Bürogebäude ist. Wenn dabei bei den anonymisierten Daten die Größe des Gebäudes "weganonymisiert" wird, dann kann man solche Faktoren, wie in großen Gebäuden arbeiten die Angestellten insgesamt später nicht mehr einbeziehen.

Ziel: Die Prognosen sollen auf den anonymisierten Daten beihnage genau so gut sein, wie auf den Rohdaten zuvor.

Prognose
Information loss

Figure out the format for categorical data

Castleguard (and probably Delta-Doca) don't work with categorical (i.e. 'String') values, so we have to convert them into some sort of trees

2. Ampeers Presentation (20.06.)

Visualization of what we do:

Architecture diagram
Dataflow
Anonymization

Simple explanation of what the different privacy methods do.

Validation module

We want to validate the privacy guarantees defined in the current algorithm parameters.

We will do this at the KASD - Visualization Tool level.

CASTLEGUARD - Basis

Base Skeleton for Algorithm:

Definition (Differentially Private k-anonymity in Data
Streams). Algorithm A satisfies differentially private k-
anonymity over an arbitrary input data stream D if:

β-Sampling: When D yields a tuple it is immediately suppressed with probability 1 − β
Perturbation: Sampled tuples have their QI values perturbed using additive noise and are grouped using generalized clusters over perturbed values
k-Suppression: Generalisations are suppressed (not published) if they appear less than k times

Reference

Tasks

Initialization of Castle Algorithm with respective Parameters
Base ("controller") for the algorithm (see pseudocode and insert() method in castle.py)
Random omission of a data point (β-sampling)
Adding a data point to a cluster
Output of a cluster + delay constraint
- In conjunction with Cluster Operations Team → publish clusters as small as possible
Interface for Perturbation, Best Selection

DevDiary: Fjoni

This is the dev diary of @fjoniyz

Delta-DOCA - Connection to Kafka Streams

Open questions:

Implementation

It seems like we need to collect the previously emitted datapoints to pass them as an 2D array to DOCA:

How and at which point should this data be stored
How many datapoints should we hold? And what to do if we reach our threshold? Sounds like a sliding window approach.

CASTLEGUARD - Cluster Operations

Implementing essential Cluster Operations

Best cluster choice for new items
Cluster merging to achieve k-anonymity
Split cluster if possible

Reference: best_selection() , split(), split_() , generate_bucket(), merge_cluster()

Extend the Kafka Streams approach

Implement the properties file
Have a look with which algorithm to extend the approach
Binding to Python or other files in other programming languages
Containerize

Plan: Library of anonymization algos

How can we integrate different anonymization algorithms easily in our toolkit? Are there existing libraries to be used? Can we create our own from different existing implementations?

DevDiary: flo

This is the development diary for the ADSP Project by @overflw.

Create 3rd presentation

Create Energy-Data Generator

CASTLEGUARD - ClusterManager Unit Tests

CASTLEGUARD - Castle Unit Tests

Write the unit tests for the Castleguard class

Midterm presentation

Benchmarking Plan

What is the computational overhead and time lag of our implementation?

Compare:

Vanilla kafka
0-extension (don't modify data)
Apply modification

We want to do this for the kafka-streams approach and the interceptor / kafka connect approches.

Further we want to do this for different cluster sizes.

CASTLEGUARD - Pertubation

Noise adding for Differential Privacy

Perturbation: Sampled tuples have their QI values perturbed using additive noise and are grouped using generalized clusters over perturbed values

Perturbation with different distributions - could have different degrees of usability and privacy.
- Laplace: (used in Paper)
  - string privacy guarantee
  - suitable for aggregation queries
- Gaussian:
  - could be less disruptive than Laplace when high dimensional data
  - noise scale must be proportional to the sensitivity of the function divided by the desired privacy loss ε.
  - the scale of the Gaussian noise is more significant than that of the Laplacian noise
  - weaker privacy guarantees
- Exponential:
  - exponential mechanism selects outputs with probability proportional to an exponential function of their utility
  - used in differentially private mechanisms where the desired output is a value from a continuous range or a complex dataset, rather than a simple count or sum
  - not universally applicable (unlike Laplace and Gaussian)
- Geometric:
  - often used in the local dp
  - noise is added to individual data points rather than aggregate results
  - applicable in settings where the data and queries are all integer-valued
  - strong privacy guarantee in that case
- Poisson:
  - in general worse than Laplace and Gaussian
  - often used for counts (of events)
Possibly more detailed research on possible distributions
CASTLEGUARD reference: fudge_tuple() in castle.py (from line 216)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	for (Cluster cluster : sc) {
	for (Item tuple : cluster.getContents()) {
	List<Item> g = new ArrayList<>();
	for (Item t : c.getContents()) {
	if (Objects.equals(t.getData().get("pid"), tuple.getData().get("pid"))) {
	g.add(t);
	}
	}
	for (Item t : g) {
	cluster.insert(t);
	}
	}
	bigGamma.add(cluster);
	}