diffix / desktop Goto Github PK
View Code? Open in Web Editor NEWCross-platform desktop application for data anonymization with Open Diffix Elm.
Home Page: https://www.open-diffix.org
License: Other
Cross-platform desktop application for data anonymization with Open Diffix Elm.
Home Page: https://www.open-diffix.org
License: Other
During CSV import we should either get really good at auto-detecting the delimiter used in the CSV and auto guessing the types of columns, or we should allow the user to specify it. Maybe we should guess, and then allow the user to fine tune?
Not sure if this is a GUI or CLI issue.
It would be good to experiment with anonymizing large CSVs both in our pure .NET code base as well as in the transpiled one in Electron. Are things slow because of the data volume/size (irrespective of JS or .NET) or because of running the anonymization in JS?
You need to implement DiffixAnonymizer
in src/state/anonymizer.ts
.
In the file census_small.csv, the age column is being interpreted as text (at least, it seems so because the column is sorted alphabetically instead of numerically)
Sorting results by a column doesn't do anything.
Browse the examples and pick something which matches with our use case https://ant.design/components/layout/#components-layout-demo-side
It should be possible to export the results as a CSV so you can take the anonymized results out of the anonymizer and use them in your reports or whatnot.
The result produced should show:
I wonder if it shouldn't also be possible to hide this view, so you only see the anonymized results?
By default suppressed rows and unanonymized aggregates should not make it into the exports!(?)
One option would be to show the distortion information (actual or magnitude) per-bucket.
Another option would be to only show aggregate information, like:
I think the latter is easier to digest.
For .deb
artifacts we need to add executableName
prop in package.json
: config.forge.packagerConfig.executableName
.
For .rpm
artifacts we need a license field in package.json
https://github.com/tjanczuk/edge
Is it maintained at all? Last update in 2018...
@fjab Do we have a logo for Open Diffix?
If you have a particular naming scheme in mind please speak up. It will be hard to migrate once we write a bunch of components.
If there are no objections we can go with BEM.
To reproduce:
=> results in white screen / crash of app
MacOS 11.5.1, easy_diffix 0.1.2
The platform specific binaries need to be bundled with the Electron app. I don't yet have a good sense of how to do this.
We have some basic typing for a schema. Use that to pick which columns we want rendered.
Chromely is an alternative to Electron.NET, which seems less popular, but with more active maintenance (0/42 vs 10/120 PRs).
If we compute the raw and anonymized data separately, we will need to first do two passes through the dataset, and then an additional pass through the results to combine the two sets of buckets. This might be pretty slow for larger inputs.
Alternatively, we could create a custom SQL statement for the reference tool that computes everything we need in a single pass:
SELECT
column1,
column2,
count(*) AS real_count,
diffix_count(aid) AS anon_count,
diffix_lcf(aid) AS suppressed
FROM table
GROUP BY 1, 2
Can we hook the CSV reader so that it emits rows to another parallel thread and stores to an sqlite DB (or other fast to read format)? This could happen in the background while a query is doing a full scan.
The file can be hidden in some directory and we can use the hash (which we already have) as an identifier.
Not sure how much we would gain by this. Maybe benchmark a CSV scan against an SQLite scan?
We should allow the analyst to bucketize or generalize columns.
This does rely on having the types of columns specified such that we can know how to do generalization. Say turns numbers into ranges, or redact parts of a string.
Is it still being maintained?
Does IPC work?
We need to agree on the way the Frontend communicates with the Backend.
Since transpiling the reference code to JS resulted in poor performance, the anonymization code will stay in dotnet.
Furthermore, I don't think it is a good idea to manually build the query AST in JS land. It couples the Frontend and Backend internals too much. Sending a SQL statement feels cleaner.
As input we send: filename, query statement, anonymization settings.
As output we get: query result or an error.
We pass the input as command-line arguments , we get back the query result (as either CSV or JSON) in the stdout stream or we get an error in stderr stream.
PROs:
CONs:
We will need an additional .NET project in this repository that loads the core reference library and dispatches anonymization requests to it. We pass the input as a JSON object and we get back a JSON object with the result or error. We need to decide if we use a socket or the process stdio streams for message exchange.
PROs:
CONs:
I am slightly in favor of Option 1 (I don't consider the drawbacks for it too big).
Re-running the anonymization tends to produce a new and different set of values.
It seems the anonymizer is not stable, despite the seed being set and fixed.
It should be possible to sort the data exported by a column (ascending or descending).
This makes it easier to consume and inspect the results of the tool.
The build-x
versions are slow. I want to be able to run the fast and base version for development purposes. When we added the trimming flag to the fsproj, it no longer works.
I.e. show the difference in counts between the anon count and the real count. And see the rows that were anonymized away.
I suggest making it possible to toggle whether you just want the anonymized view, or the view that shows the diff.
Pre-requisites:
To reproduce:
The app will issue two anonymization requests that are run one after the other.
As soon as the first one returns a result, the frontend will consider the result as finished, and will stop showing the "processing animation". This makes it appear to the analyst as if the second column that was added for anonymization was dropped/ignored. A while later, once the second anonymization ends, the result will be updated and everything will be as expected.
It would be ideal if we could terminate other ongoing and queued anonymization requests when making a new one. That way the time to useful results is kept smaller. This would also solve this bug as only the desired anonymization request would ever be returned.
Alternatively we could send a request ID with the anonymization requests, and only mark the anonymiaztion requests as complete if the result for the desired request has been returned. This might be generally useful as we otherwise end up with the risk of a race conditions...
Should we add strict: true
to our tsconfig?
More information about it can be found here.
I am in favor of going as strict as possible now as we are starting out.
When cloning you need to init the submodule. Those are a pain to work with...
I used this electron-snowpack thing to get electron and snowpack to work hand in hand: https://github.com/karolis-sh/electron-snowpack#readme
This might have been a bad idea, and I have a hunch that is also what is causing the building to fail.
Maybe we should start from a plain Electron setup instead.
Our goal is a very easy installation of Easy Diffix.
It was mentioned that .NET, or at least the crucial files, can be bundled with the installation. I think we should really look into that.
Having said that, for a beta / alpha version, it's not necessary. But maybe a nice piece of work for in between?
We could add a formatter check like the one we have for the website: https://github.com/diffix/website/blob/main/.github/workflows/checks.yml
Once we have tests it would be good to run those too.
Are there some typescript specific checks we could run already?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.