Giter VIP home page Giter VIP logo

ganges's Introduction

Hi there 👋! This is Fjoni!

About me 😄

  • computer science student at TU Berlin
  • interested in software engineering, distributed systems

You can contact me 📫

ganges's People

Contributors

fjoniyz avatar ingastrelnikova avatar nomorehumor avatar nomorelinux avatar overflw avatar ralfons-06 avatar sofia-001 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

ganges's Issues

First approach: Kafka proxy

We need to implement a way to change the data in Kafka before writing into the topic. One option was using a broker as a type of proxy.

CastleGuard only accepts one sensitive Attribute

@sofia-001 and me wondered why the CASTLEGUARD Implementation only allows one sensitive Attribute, which is used to specify which column will be L-Diversified. Since it should be possible to L-diversify more the one column, this seems like an issue with the example implementation we used as a base.

We also asked Philip for help regarding this Issue and he basically confirmed that we should be able to specify more than one sensitive Attribute. I will post his full reply below.

Find dataset to run through Kafka

We need to find a dataset to run through Kafka.
What this means:

  • Retrieve the data from a file(e.g. csv), and write each of the entries in the file in a Kafka topic
  • After the entries are written in the Kafka topic, one should create a subscriber to this topic and read the data

For any suggestions of data sets you can pack them here as a comment and after that we can choose which one we want to use 😄

CASTLEGUARD - Cluster

Implementing Clusters for storing and generalizing Data

  • Generalization and addition of newly arriving data
    • based on the data that are already in the cluster ("range").
  • Calculation of the increase of the range when a new data point is added.
  • Calculation of the increase of the range when two clusters are merged.
  • Query the information content of the cluster -> information loss
    • Research fitting metrics for this purpose

CASTLEGUARD reference: cluster.py, range.py

Finish visualization prototype setup

We want to showcase our setup, visualizing our hard-coded example csv data from ampeers in our grafana setup.
The data should be manipulated in our streams-application.

Json Parser

Based on the sensible data point keys and anonymization algos, we need to parse the incoming topic messages and pass them to the algo class.

Multithreading

  • Run anonymization in different threads that process different topics

Bug: Split-l

I think I found the issue with the Split-L implementation.

Lines 546-550 should represent this part of the pseudocode:

foreach sub-cluster SC(i) in SC do
    foreach tuple t(i) in SC(i) do
        Let G(t(i)) be set of tuples in C, such that G(t(i)) = { t in C | t.pid = t(i).pid}
        Insert G(t(i)) into SC(i)
        Delete G(t(i)) from C

L. 546-550:

for c in sc:
    for t in c.contents:
        G = [t_h for t_h in C.contents if t_h['pid'] == t['pid']]
        for _ in G:
            c.insert(t)

It seems that instead of adding the tuples from G to the cluster, the tuple from which G is calculated is added. So basically the snipped should look like this:

for c in sc:
    for t in c.contents:
        G = [t_h for t_h in C.contents if t_h['pid'] == t['pid']]
        for t_h in G:
            c.insert(t_h)

However, In our implementation this seems to be implemented correctly:

for (Cluster cluster : sc) {
for (Item tuple : cluster.getContents()) {
List<Item> g = new ArrayList<>();
for (Item t : c.getContents()) {
if (Objects.equals(t.getData().get("pid"), tuple.getData().get("pid"))) {
g.add(t);
}
}
for (Item t : g) {
cluster.insert(t);
}
}
bigGamma.add(cluster);
}

Quality Benchmark

Bestimmte Auswertungen sollen auf Rohdaten genau so gut funktionieren, wie auf den anonymisierten Daten.
Das Problem wird dabei an folgendem Beispiel verdeutlicht:
Anhand der Rohdaten wird eine Prognose erstellt, wie der Stromverbrauch in einem Bürogebäude ist. Wenn dabei bei den anonymisierten Daten die Größe des Gebäudes "weganonymisiert" wird, dann kann man solche Faktoren, wie in großen Gebäuden arbeiten die Angestellten insgesamt später nicht mehr einbeziehen.

Ziel: Die Prognosen sollen auf den anonymisierten Daten beihnage genau so gut sein, wie auf den Rohdaten zuvor.

  • Prognose
  • Information loss

2. Ampeers Presentation (20.06.)

Visualization of what we do:

  • Architecture diagram
  • Dataflow
  • Anonymization

Simple explanation of what the different privacy methods do.

Validation module

We want to validate the privacy guarantees defined in the current algorithm parameters.

We will do this at the KASD - Visualization Tool level.

CASTLEGUARD - Basis

Base Skeleton for Algorithm:

Definition (Differentially Private k-anonymity in Data
Streams).
Algorithm A satisfies differentially private k-
anonymity over an arbitrary input data stream D if:

  • β-Sampling: When D yields a tuple it is immediately suppressed with probability 1 − β
  • Perturbation: Sampled tuples have their QI values perturbed using additive noise and are grouped using generalized clusters over perturbed values
  • k-Suppression: Generalisations are suppressed (not published) if they appear less than k times

image

Reference

Tasks

  • Initialization of Castle Algorithm with respective Parameters
  • Base ("controller") for the algorithm (see pseudocode and insert() method in castle.py)
  • Random omission of a data point (β-sampling)
  • Adding a data point to a cluster
  • Output of a cluster + delay constraint
    • In conjunction with Cluster Operations Team → publish clusters as small as possible
  • Interface for Perturbation, Best Selection

Delta-DOCA - Connection to Kafka Streams

Open questions:

Implementation

It seems like we need to collect the previously emitted datapoints to pass them as an 2D array to DOCA:

  • How and at which point should this data be stored
  • How many datapoints should we hold? And what to do if we reach our threshold? Sounds like a sliding window approach.

CASTLEGUARD - Cluster Operations

Implementing essential Cluster Operations

  • Best cluster choice for new items
  • Cluster merging to achieve k-anonymity
  • Split cluster if possible

Reference: best_selection() , split(), split_() , generate_bucket(), merge_cluster()

Extend the Kafka Streams approach

  • Implement the properties file
  • Have a look with which algorithm to extend the approach
  • Binding to Python or other files in other programming languages
  • Containerize

Plan: Library of anonymization algos

How can we integrate different anonymization algorithms easily in our toolkit? Are there existing libraries to be used? Can we create our own from different existing implementations?

Benchmarking Plan

What is the computational overhead and time lag of our implementation?

Compare:

  • Vanilla kafka
  • 0-extension (don't modify data)
  • Apply modification

We want to do this for the kafka-streams approach and the interceptor / kafka connect approches.

Further we want to do this for different cluster sizes.

CASTLEGUARD - Pertubation

Noise adding for Differential Privacy

Perturbation: Sampled tuples have their QI values perturbed using additive noise and are grouped using generalized clusters over perturbed values

image

  • Perturbation with different distributions - could have different degrees of usability and privacy.
    • Laplace: (used in Paper)
      - string privacy guarantee
      - suitable for aggregation queries
    • Gaussian:
      - could be less disruptive than Laplace when high dimensional data
      - noise scale must be proportional to the sensitivity of the function divided by the desired privacy loss ε.
      - the scale of the Gaussian noise is more significant than that of the Laplacian noise
      - weaker privacy guarantees
    • Exponential:
      - exponential mechanism selects outputs with probability proportional to an exponential function of their utility
      - used in differentially private mechanisms where the desired output is a value from a continuous range or a complex dataset, rather than a simple count or sum
      - not universally applicable (unlike Laplace and Gaussian)
    • Geometric:
      - often used in the local dp
      - noise is added to individual data points rather than aggregate results
      - applicable in settings where the data and queries are all integer-valued
      - strong privacy guarantee in that case
    • Poisson:
      - in general worse than Laplace and Gaussian
      - often used for counts (of events)
  • Possibly more detailed research on possible distributions
  • CASTLEGUARD reference: fudge_tuple() in castle.py (from line 216)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.