Giter VIP home page Giter VIP logo

Comments (12)

cristianberneanu avatar cristianberneanu commented on July 28, 2024

After thinking about it a bit more, I think they should be included in the seed (but they should be ignored during aggregation).

NULL values were ignored during aggregation in the previous system, are ignored in the reference implementation and are also ignored in standard SQL aggregators.

But the seed is a property of the bucket, so any encountered AIDs have to contribute to it. If we find a way to compute the seed while we digest data, then we can drop NULL handling from the aggregators, simplifying stuff a bit more.

from pg_diffix.

sebastian avatar sebastian commented on July 28, 2024

On the assumption that it's the value contributed by the AID that's null, and not the AID itself, then I think it should be:

  • In count(*) then yes. The null value is counted and hence the AID should also be used in the seed material
  • For aggregates where null-values are ignored the AID does not contribute to the seed for such a value (such as count(price) for a price of null)

from pg_diffix.

sebastian avatar sebastian commented on July 28, 2024

But the seed is a property of the bucket, so any encountered AIDs have to contribute to it.

Why should any encountered AID contribute to the seed, if that AID doesn't otherwise contribute to the bucket?
I think exactly the "The seed is a property of the bucket" statement rings true, but I would read that as an argument for excluding the AID from the seed in the case where it didn't contribute to the aggregate.

from pg_diffix.

edongashi avatar edongashi commented on July 28, 2024

Why should any encountered AID contribute to the seed, if that AID doesn't otherwise contribute to the bucket?

diffix_lcf(aid) has no knowledge of what's happening in aggregates.

Hmm, let's consider this scenario:

There are 100 rows of shape (aid, col) in a bucket where AIDs are unique in interval [1...100].
Let's suppose 99 rows have col set to NULL and we want to calculate count(col).
Should we suppress this bucket? diffix_lcf(aid) says no because it has no idea that col is NULL...

from pg_diffix.

cristianberneanu avatar cristianberneanu commented on July 28, 2024

As much as I would like to use this opportunity to get rid of some code, I still think it should affect the seed.
Even if it has no effect on the aggregate, it still helps the bucket pass LCF.

from pg_diffix.

sebastian avatar sebastian commented on July 28, 2024

There are 100 rows of shape (aid, col) in a bucket where AIDs are unique in interval [1...100].
Let's suppose 99 rows have col set to NULL and we want to calculate count(col).
Should we suppress this bucket? diffix_lcf(aid) says no because it has no idea that col is NULL...

It passes the low count filter (and all AIDs contribute to the seed for the low count filter), but we don't produce an aggregate, because we have insufficient data for the aggregate.
We can still output some value such as null, <insufficient data for an aggregate count> (assuming the query was SELECT col, count(...)).

from pg_diffix.

cristianberneanu avatar cristianberneanu commented on July 28, 2024

The main question is do we need to have the same seed for all aggregators (including LCF) or not.
Previously, the seed was computed separately and per bucket. We don't have this option now, so maybe we don't have to keep the same design, unless it causes a vulnerability.

from pg_diffix.

yoid2000 avatar yoid2000 commented on July 28, 2024

There are a number of different things being discussed here, so I'm a bit confused. The set of questions seem to be:

  1. Should a NULL AID value contribute to the seed?
  2. Should a NULL AID value be counted as a distinct user when counting the number of distinct users?
  3. Do we need to have the same seed for all aggregators (including LCF) or not?

Regarding 1, what would cause an AID to be NULL?

Regarding 2, can we avoid this question by always knowing what the actual AID is?

Regarding 3, this question doesn't arise for Publish AFAIK, and I think it is premature to ask it for the other variants.

In fact, none of this really matters for Publish...

from pg_diffix.

edongashi avatar edongashi commented on July 28, 2024

Should a NULL AID value contribute to the seed?

The question is not for NULL AID but for NULL contribution coming from a (non-null) AID.

In the example below, which AIDs will be used for the AID noise seed for count(col): 1,2,4 or 1,2,3,4?

aid col
1 'a'
2 'b'
3 NULL
4 'c'

from pg_diffix.

yoid2000 avatar yoid2000 commented on July 28, 2024

Ah, I thought you meant the AID itself was NULL, not the aggregate.

My intuition is that the AIDs with NULL contribution to the aggregator should be included in the seed as well as the LCF computation.

And I can't think of an attack that would exploit this.

And I presume it is simpler to just include the AID in all cases (no special cases to deal with NULL).

So let's go with including all AIDs regardless of contribution to the aggregator.

from pg_diffix.

cristianberneanu avatar cristianberneanu commented on July 28, 2024

And I presume it is simpler to just include the AID in all cases (no special cases to deal with NULL).

Actually, it is easier to exclude NULL contributions (simpler to ignore stuff sooner, rather than later).
But code is already written to include them in the extension, so maybe simpler to leave it so? Then we need to update the reference implementation to do the same.

from pg_diffix.

cristianberneanu avatar cristianberneanu commented on July 28, 2024

Closing this as NULL contributions are already included in the seed.

from pg_diffix.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.