intenthq / anon Goto Github PK
View Code? Open in Web Editor NEWA UNIX Command To Anonymise Data
License: MIT License
A UNIX Command To Anonymise Data
License: MIT License
Sometimes you want to make sure the hash
action is irreversible and not vulnerable to rainbow table attacks. To support this, it would be useful if one was able to optionally turn on random salts being added to the hash (and perhaps this should be the default, for safety).
For example, given the following config and CSV, you'd expect to get the following output:
Config:
{
"csv": {
"delimiter": ","
},
"actions": [
{
// Salt is not given, so is random and on by default.
"name": "hash"
},
{
"name": "hash",
// Have no salt.
"salt": false
},
{
"name": "hash",
// Have a salt, but once which stays the same for all values.
"salt": "somesalt"
}
]
}
Input:
foo,bar,lux
Output:
d8b685c1a4b889369299f275d583e34f94831bb6,62cdb7020ff920e5aa642c3d4066950dd1f01f4d,98307a2daa4aa31a9e0b2deeeb98dad737970927
Where the first column is effectively random, the second column is a deterministic hash, and the third is deterministic but with the salt added as a suffix. That is:
sha1(foo<some random noise>,sha1(bar),sha1(luxsomesalt)
It would be great to be able to entirely remove a column from the input.
Config could be something like the following:
{
"actions": [
{
"name": "identity"
},
{
"name": "remove"
},
{
"name": "hash"
}
]
}
Then, for an input like this one:
a,b,c
d,e,f
The output would be:
a,84a516841ba77a5b4648de2cd0dfcb30ea46dbb4
d,4a0a19218e082a343a1b17e5333409af9d98f0f5
Default behaviour: fail entire file
Configurable behaviour: skip row
There are some comments on our HN post about this tool that are concerned that we don't address the elephant in the room: that this tool is really not a good solution if you intend to make the resulting data public. There is countless research to show that de-anonymising data is completely possible with increasingly less effort because there are almost always unique "fingerprints" leftover in anonymised data.
We should add something to the README that:
Hello,
It would be great if this software supported k-anonymity, so that no row in the output were uniquely distinguishable. It should be possible to output the maximum k
for a given dataset, as well.
Thanks!
Another very common way to reduce date precision is to group dates according to a period of time from an initial date.
For example, if we have the date of birth of a person, we may want to output what range of years the age of this person belongs to.
e.g. 1/1/1990 -> 1990 or 20-30 years
Possible config:
{
"actions": [
{
"name": "timeElapsed",
"dateConfig": {
"format": "YYYYmmmdd",
// should we count the number of months or years
"elapsedIn": "years",
// since when should we count
// accepts a date in the above format or `now` as a value
"since": "19901212"
},
"rangeConfig": {
"ranges": [
{
"gt": 20,
"lte": 30,
"output": "20-30 years"
}
]
}
}
]
}
There will be different options we can do in here:
References:
Go's csv package provides some options that we should allow via config: https://golang.org/pkg/encoding/csv/#Reader
It would be good to have a preview too
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.