Giter VIP home page Giter VIP logo

go-anonymize-mysqldump's Introduction

anonymize-mysqldump

CircleCI

Allows you to pipe data from mysqldump or an SQL file and anonymize it:

mysqldump -u yada -pbadpass -h db | anonymize-mysqldump --config config.json > anonymized.sql
usage: anonymize-mysqldump [-h|--help] -c|--config "<value>"

                           Reads SQL from STDIN and replaces content for
                           anonymity based on the provided config.

Arguments:

  -h  --help    Print help information
  -c  --config  Path to config.json

Installation

You can download the binary for your system from the Releases page. Once downloaded and gunzip'd, move it to a location in your path such as /usr/local/bin and make it executable. For instance, to download the MacOS binary for 64 bit platforms (this is most common):

LATEST="0.3.0"
curl -OL https://github.com/humanmade/go-anonymize-mysqldump/releases/download/$LATEST/go-anonymize-mysqldump_darwin_amd64.gz
gunzip go-anonymize-mysqldump_darwin_amd64.gz
mv go-anonymize-mysqldump_darwin_amd64 /usr/local/bin/anonymize-mysqldump
chmod +x /usr/local/bin/anonymize-mysqldump

Usage

This tool is designed to read a file stream over STDIN and produce an output over STDOUT. A config file is required and can be provided via the -c or --config flag. An example config for anonymizing a WordPress database is provided at config.example.json:

curl -LO https://raw.githubusercontent.com/humanmade/go-anonymize-mysqldump/master/config.example.json

Whenever the tool experiences an error, it will output a log to STDERR. If you wish to not see that output while the command is running, redirect it to some other file (or /dev/null if you don't care):

mysqldump -u yada -pbadpass -h db | anonymize-mysqldump --config config.json 2> path/to/errors.log > anonymized.sql

Caveats

Important things to be aware of!

  • Currently this only modifies INSERT statements. Should you wish to modify other fields, feel free to submit a PR.
  • Verify the output file has been modified. This is a friendly reminder this tool is still in its early days and you should verify the output sql file before distributing it to ensure the desired modifications have been applied.

Config File

An example config for anonymizing a WordPress database is provided at config.example.json.

The config is composed of many objects in the patterns array:

  • patterns: an array of objects defining what modifications should be made.
    • tableName: the name of the table the data will be stored in (used to parse INSERT statements to d etermine if the query should be modified.)
    • fields: an array of objects defining modifications to individual values' fields
      • field: a string representing the name of the field. Not currently used, but still required to work and useful for debugging.
      • position: the 1-based index of what number column this field represents. For instance, assuming a table with 3 columns foo, bar, and baz, and you wished to modify the bar column, this value would be 2.
      • type: a string representing the type of data stored in this field. Read more about field types here.
      • constraints: an array of objects defining comparison rules used to determine if a value should be modified or not. Currently these are limited to a simple string equality comparison.
        • field: a string representing the name of the field.
        • position: the 1-based index of what number column this field represents. For instance, assuming a table with 3 columns foo, bar, and baz, and you wished to modify the bar column, this value would be 2.
        • value: string value to match against.

Constraints

Supposing you have a WordPress database and you need to modify certain meta, be it user meta, post meta, or comment meta. You can use constraints to update data only whenever a certain condition is matched. For instance, let's say you have a user meta key last_ip_address. If you wanted to change that value, you can use the following config in the fields array:

{
  "field": "meta_value",
  "position": 4,
  "type": "ipv4",
  "constraints": [
    {
      "field": "meta_key",
      "position": 3,
      "value": "last_ip_address"
    }
  ]
}

Field Types

Each column stores a certain type of data, be it a name, username, email, etc. The type property in the config is used to define the type of data stored, and ultimately the type of random data to be inserted into the field. https://github.com/dmgk/faker is used for generating the fake data. These are the types currently supported:

  • username
  • password
  • email
  • url
  • name
  • firstName
  • lastName
  • paragraph
  • ipv4

If you need another type, please feel free to add support and file a PR!

Credit

Many thanks to Automattic/go-search-replace for serving as the starting point for this tool! Also many thanks to xwb1989/sqlparser for the SQL parsing library. I wouldn't have been able to do this without it!

go-anonymize-mysqldump's People

Contributors

nathanielks avatar svandragt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

go-anonymize-mysqldump's Issues

Crash if asked to replace a value which can be NULL

Given a database like

DROP TABLE IF EXISTS `test`;
CREATE TABLE `test` (
`email` varchar(255),
);
INSERT INTO `test` VALUES ('foo'), (NULL), ('hodger');

if you use a config like

{
  "patterns": [
    {
      "tableName": "test",
      "fields": [
        {
          "field": "email",
          "position": 1,
          "type": "email",
          "constraints": null
        }
      ]
    }
  ]
}

the tool will crash with a panic:

panic: interface conversion: sqlparser.Expr is *sqlparser.NullVal, not *sqlparser.SQLVal

goroutine 35 [running]:
main.modifyValues(0xc0003f6060, 0x3, 0x4, 0xc00008e2f8, 0x4, 0xc0001e6000, 0x1, 0x4, 0x0, 0x0, ...)
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:299 +0x572
main.applyConfigToInserts(0xc00041a0b0, 0xc0001e4000, 0x1, 0x4, 0x0, 0x0, 0x0)
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:277 +0x1f9
main.applyConfigToParsedLine(0x126f020, 0xc00041a0b0, 0xc0001e4000, 0x1, 0x4, 0x0, 0x0, 0x0, 0x0)
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:250 +0x68
main.processLine(0xc0001cc1c0, 0x36, 0xc0001e4000, 0x1, 0x4, 0x0, 0x0)
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:218 +0x2c1
main.processInput.func1(0xc00008e320, 0xc0001e4000, 0x1, 0x4, 0xc0003f20c0, 0xc0001cc1c0, 0x36)
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:194 +0xab
created by main.processInput
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:192 +0x41c

If NULL does not appear, it works fine.

Add CI

Need to run tests automatically.

Get this working with multi-line statements

I discovered today that this won't be able to parse multi-line insert statements like the following SQL:

INSERT INTO `wp_cavalcade_jobs` VALUES
(1,1,"wp_version_check","a:0:{}","2017-08-24 08:15:12","2017-09-11 20:15:12",43200,"failed"),
(2,1,"wp_update_plugins","a:0:{}","2017-08-24 08:15:12","2017-09-11 20:15:12",43200,"failed"),
(3,1,"wp_update_themes","a:0:{}","2017-08-24 08:15:12","2017-09-11 20:15:12",43200,"failed"),
(4,1,"wp_scheduled_delete","a:0:{}","2017-10-15 10:11:15","2017-10-15 10:11:15",86400,"failed");

Because we're parsing line by line, it reads each value as its own SQL statement which breaks the parser. I need to either merge them all into a single line or find some other solution.

Generate random data instead of fixed strings

While I've been building the prototype, I've been replacing values using functions with fixed strings. I want to use a faker-type library to actually generate random data, but unit testing randomly generated data is tough so I've been deferring solving that problem.

I just realized I could use the existing functions as mocks for testing purposes, now I just need to write the integration.

Ability to skip rows

We're currently using production database in dev.

We plan to improve it by anonymizing the user data.

However we have some test/dev/admin accounts with a variety of roles and associated data which we want to preserve.

Would be nice being able to skip any transformation to a row if a condition matches. Example config we'd use to achieve this

      "tableName": "users",
      "ignore": [
        {
          "field": "mail",
          "position": 6,
          "value": "[email protected]"
        },
        {
          "field": "mail",
          "position": 6,
          "value": "[email protected]"
        }
      ],
      "fields": [

As a work around:

We take 2 mysqldumps, the first one with all data going through anonymization.

The 2nd dump is appended to the original with "REPLACE INTO" queries with the users we plan to keep as is.

mysqldump --no-create-info --where "email like '%admin.com'" --replace application user_data > own_users.sql

[BUG] Generated emails & usernames aren't unique

If you have a UNIQUE index set for those fields, anonymize will break the dump as there's no option for truly unique emails/usernames.

My work around has been to add new transformation functions:

// add these imports
import (
    "strconv"
    "time"
)

func generateUniqueEmail(value *sqlparser.SQLVal) *sqlparser.SQLVal {
	return sqlparser.NewStrVal([]byte(strconv.Itoa(int(time.Now().UnixNano())) + "@gmail.com"))
}
func generateUniqueUsername(value *sqlparser.SQLVal) *sqlparser.SQLVal {
	return sqlparser.NewStrVal([]byte(faker.Internet().UserName() + strconv.Itoa(int(time.Now().UnixNano()))))
}

Installation example doesn't work ("not in gzip format")

I just tried to follow the installation instructions for Mac. I ran

curl -OL https://github.com/humanmade/go-anonymize-mysqldump/releases/download/latest/go-anonymize-mysqldump_darwin_amd64.gz

and then tried

gunzip go-anonymize-mysqldump_darwin_amd64.gz

However, the latter returns

gunzip: go-anonymize-mysqldump_darwin_amd64.gz: not in gzip format

Make invalid field type error more clear

I'm getting a:

panic: interface conversion: sqlparser.Expr is *sqlparser.NullVal, not *sqlparser.SQLVal

goroutine 33 [running]:
main.modifyValues(0xc0004f4000, 0x4fd, 0x555, 0xc000014420, 0x9, 0xc0000f8700, 0x4, 0x4, 0x0, 0x0, ...)
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:299 +0x572
main.applyConfigToInserts(0xc0000a2dc0, 0xc0000c08c0, 0x2, 0x4, 0x0, 0x0, 0x0)
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:277 +0x1f9
main.applyConfigToParsedLine(0x66be60, 0xc0000a2dc0, 0xc0000c08c0, 0x2, 0x4, 0x0, 0x0, 0x0, 0x0)
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:250 +0x68
main.processLine(0xc000560000, 0xff0d5, 0xc0000c08c0, 0x2, 0x4, 0x0, 0x0)
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:218 +0x2c1
main.processInput.func1(0xc000014550, 0xc0000c08c0, 0x2, 0x4, 0xc0003dc360, 0xc000560000, 0xff0d5)
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:194 +0xab
created by main.processInput
	/Users/aang/Code/golang/src/github.com/humanmade/go-anonymize-mysqldump/anonymize-mysqldump.go:192 +0x41c
mysqldump: Got errno 32 on write

I believe it is because i set some field types that are not yet supported.
If that's the case, i think the error should be more clear, so it's easier to know that it's my fault and not a real error : O

Enable ability to specify custom configs

I want to be able to add additional configs, as in when applications have additional fields to anonymize or aren't even WordPress. Ideally this would be in a JSON/YAML/TOML format.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.