Comments (12)
After thinking about it a bit more, I think they should be included in the seed (but they should be ignored during aggregation).
NULL
values were ignored during aggregation in the previous system, are ignored in the reference implementation and are also ignored in standard SQL aggregators.
But the seed is a property of the bucket, so any encountered AIDs have to contribute to it. If we find a way to compute the seed while we digest data, then we can drop NULL
handling from the aggregators, simplifying stuff a bit more.
from pg_diffix.
On the assumption that it's the value contributed by the AID that's null
, and not the AID itself, then I think it should be:
- In
count(*)
then yes. Thenull
value is counted and hence the AID should also be used in the seed material - For aggregates where
null
-values are ignored the AID does not contribute to the seed for such a value (such ascount(price)
for a price ofnull
)
from pg_diffix.
But the seed is a property of the bucket, so any encountered AIDs have to contribute to it.
Why should any encountered AID contribute to the seed, if that AID doesn't otherwise contribute to the bucket?
I think exactly the "The seed is a property of the bucket" statement rings true, but I would read that as an argument for excluding the AID from the seed in the case where it didn't contribute to the aggregate.
from pg_diffix.
Why should any encountered AID contribute to the seed, if that AID doesn't otherwise contribute to the bucket?
diffix_lcf(aid)
has no knowledge of what's happening in aggregates.
Hmm, let's consider this scenario:
There are 100 rows of shape (aid, col)
in a bucket where AIDs are unique in interval [1...100]
.
Let's suppose 99 rows have col
set to NULL
and we want to calculate count(col)
.
Should we suppress this bucket? diffix_lcf(aid)
says no because it has no idea that col
is NULL
...
from pg_diffix.
As much as I would like to use this opportunity to get rid of some code, I still think it should affect the seed.
Even if it has no effect on the aggregate, it still helps the bucket pass LCF.
from pg_diffix.
There are 100 rows of shape (aid, col) in a bucket where AIDs are unique in interval [1...100].
Let's suppose 99 rows have col set to NULL and we want to calculate count(col).
Should we suppress this bucket? diffix_lcf(aid) says no because it has no idea that col is NULL...
It passes the low count filter (and all AIDs contribute to the seed for the low count filter), but we don't produce an aggregate, because we have insufficient data for the aggregate.
We can still output some value such as null, <insufficient data for an aggregate count>
(assuming the query was SELECT col, count(...)
).
from pg_diffix.
The main question is do we need to have the same seed for all aggregators (including LCF) or not.
Previously, the seed was computed separately and per bucket. We don't have this option now, so maybe we don't have to keep the same design, unless it causes a vulnerability.
from pg_diffix.
There are a number of different things being discussed here, so I'm a bit confused. The set of questions seem to be:
- Should a NULL AID value contribute to the seed?
- Should a NULL AID value be counted as a distinct user when counting the number of distinct users?
- Do we need to have the same seed for all aggregators (including LCF) or not?
Regarding 1, what would cause an AID to be NULL?
Regarding 2, can we avoid this question by always knowing what the actual AID is?
Regarding 3, this question doesn't arise for Publish AFAIK, and I think it is premature to ask it for the other variants.
In fact, none of this really matters for Publish...
from pg_diffix.
Should a NULL AID value contribute to the seed?
The question is not for NULL AID but for NULL contribution coming from a (non-null) AID.
In the example below, which AIDs will be used for the AID noise seed for count(col)
: 1,2,4 or 1,2,3,4?
aid | col |
---|---|
1 | 'a' |
2 | 'b' |
3 | NULL |
4 | 'c' |
from pg_diffix.
Ah, I thought you meant the AID itself was NULL, not the aggregate.
My intuition is that the AIDs with NULL contribution to the aggregator should be included in the seed as well as the LCF computation.
And I can't think of an attack that would exploit this.
And I presume it is simpler to just include the AID in all cases (no special cases to deal with NULL).
So let's go with including all AIDs regardless of contribution to the aggregator.
from pg_diffix.
And I presume it is simpler to just include the AID in all cases (no special cases to deal with NULL).
Actually, it is easier to exclude NULL contributions (simpler to ignore stuff sooner, rather than later).
But code is already written to include them in the extension, so maybe simpler to leave it so? Then we need to update the reference implementation to do the same.
from pg_diffix.
Closing this as NULL contributions are already included in the seed.
from pg_diffix.
Related Issues (20)
- `treat_unmarked_tables_as_public` to work with views also HOT 1
- Fix incompatibility with OmniDB-ng
- Fix incompatibility with Sidu HOT 1
- Fix incompatibility with Adminer HOT 4
- Casts are not documented in analyst guide
- In accuracy in count(distinct) HOT 4
- Use offsetof for flexible struct allocation HOT 2
- `IN (SELECT ...)` bypasses `make_bucket_scan`
- dummy_agg_noise_transfn does not need to hold stype=AnonAggState
- Store aid trackers inline in LowCountState HOT 1
- Reject queries directly selecting AID columns. HOT 4
- Support parameterized queries.
- Reject unsupported AID column types during labeling.
- Add support for more AID column types.
- `count_histogram` tests on my system produce memory corruption warnings. HOT 5
- Make sorting of the *-bucket consistent and controllable HOT 3
- `diffix.avg/sum_noise` for time-related columns crashes pg_diffix
- `WHERE date_trunc...` crashes
- Enable more datetime generalization functions
- Ensure and test if it works on PostgreSQL 15
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pg_diffix.