Giter VIP home page Giter VIP logo

gonymizer's People

Contributors

aavision avatar dependabot[bot] avatar feliperalmeida avatar gabrielpiassetta avatar junkert avatar marlontrapp avatar mayaradg avatar miouge1 avatar nigelatdas avatar olivertso avatar pcolby avatar rafascar avatar rkuska avatar rouganstriker avatar u5surf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gonymizer's Issues

Missing vendor directory causing docker build to fail

The command docker build -t gonymizer . fails with the message:

build github.com/aws/aws-sdk-go/aws: cannot load github.com/aws/aws-sdk-go/aws: open /tmp/gonymizer/vendor/github.com/aws/aws-sdk-go/aws: no such file or directory
The command '/bin/sh -c GOOS=linux GOARCH=amd64 CGO_ENABLED=0 GOFLAGS=-mod=vendor go build -v -ldflags '-w -extldflags "-static"' -o bin/gonymizer ./cmd/...' returned a non-zero code: 1

I had to run go mod vendor on my machine before running docker build. Is this expected?

If not, adding RUN go mod vendor to Dockerfile fixes the issue, but then I would have to COPY the source code together with go.mod.
Or I can remove the use of go mod vendor from the project.

What do you think?

Last but not least, thank you for writing this great package, it's really helpful.

gonymizer "process" ends with panic: runtime error: index out of range

Hi,

I am trying to use the docker image from smithoss/gonymizer and I am having partial success.
Is the number of rows possibly capped at 3.000.000?

Here is my docker command:
docker build --no-cache --rm gonymizer

The Dockerfile:

#FROM junkert/gonymizer
FROM smithoss/gonymizer

ENV PG_SRC_HOST=
ENV PG_SRC_USER=
ENV PG_SRC_PSWD=
ENV PG_SRC_PORT=
ENV PG_SRC_DBNAME=grossolini_as
ENV PG_SRC_SCHEMA=as

ENV FILE_PREFIX=$PG_SRC_DBNAME
ENV SKEL_FILE=$FILE_PREFIX.skeleton.json
ENV TMP_FILE=$FILE_PREFIX.tmp.sql
ENV DUMP_FILE=$FILE_PREFIX.dump.sql

ENV DBG_LVL=DEBUG


RUN pg_dump --version && pg_restore --version && gonymizer version

RUN apk update
RUN apk add jq

RUN touch $SKEL_FILE && rm $SKEL_FILE
#RUN gonymizer map --map-file="$FILE_PREFIX" -S -H "$PG_SRC_HOST" -p "$PG_SRC_PSWD" -U "$PG_SRC_USER" -P "$PG_SRC_PORT" -d "$PG_SRC_DBNAME" --schema "$PG_SRC_SCHEMA" -L "$DBG_LVL"
RUN echo '{"DBName":"grossolini_as","SchemaPrefix":"","Seed":0,"ColumnMaps":[{"Comment":"","TableSchema":"as","TableName":"as_streams_search","ColumnName":"v3_deprecated_login","DataType":"character varying","ParentSchema":"","ParentTable":"","ParentColumn":"","OrdinalPosition":16,"IsNullable":true,"Processors":[{"Name":"AlphaNumericScrambler","Max":0,"Min":0,"Variance":0,"Comment":""}]},{"Comment":"","TableSchema":"as","TableName":"as_streams_search","ColumnName":"v3_deprecated_password","DataType":"character varying","ParentSchema":"","ParentTable":"","ParentColumn":"","OrdinalPosition":17,"IsNullable":true,"Processors":[{"Name":"ScrubString","Max":0,"Min":0,"Variance":0,"Comment":""}]}]}' | jq > $SKEL_FILE

RUN touch $TMP_FILE && rm $TMP_FILE
RUN gonymizer dump --dump-file="$TMP_FILE" -S -H "$PG_SRC_HOST" -p "$PG_SRC_PSWD" -U "$PG_SRC_USER" -P "$PG_SRC_PORT" -d "$PG_SRC_DBNAME" --schema "$PG_SRC_SCHEMA" -L "$DBG_LVL"

RUN gonymizer process --map-file="$SKEL_FILE" --dump-file="$TMP_FILE" --processed-file="$DUMP_FILE" --generate-seed -L "$DBG_LVL"

The output with the smithoss/gonymizer image (starting after all the ENV's):

Step 25/32 : RUN pg_dump --version && pg_restore --version && gonymizer version
 ---> Running in acc22444b47e
pg_dump (PostgreSQL) 11.3
pg_restore (PostgreSQL) 11.3
gonymizer (v1.1.1, build 5, build date:2019-05-06 23:50:02 +0000 UTC)
Go (runtime:go1.12.6) (GOMAXPROCS:2) (NumCPUs:2)
Removing intermediate container acc22444b47e
 ---> 30c431b3bc6c
Step 26/32 : RUN apk update
 ---> Running in 524cd4b2c078
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/community/x86_64/APKINDEX.tar.gz
v3.10.5-83-g4b863b300a [http://dl-cdn.alpinelinux.org/alpine/v3.10/main]
v3.10.5-73-g9ff6848e18 [http://dl-cdn.alpinelinux.org/alpine/v3.10/community]
OK: 10365 distinct packages available
Removing intermediate container 524cd4b2c078
 ---> 2dfe825e40b6
Step 27/32 : RUN apk add jq
 ---> Running in 6f1fdc068f01
(1/2) Installing oniguruma (6.9.4-r0)
(2/2) Installing jq (1.6-r0)
Executing busybox-1.30.1-r2.trigger
OK: 37 MiB in 32 packages
Removing intermediate container 6f1fdc068f01
 ---> dc8d01962a14
Step 28/32 : RUN touch $SKEL_FILE && rm $SKEL_FILE
 ---> Running in a3525973e8d8
Removing intermediate container a3525973e8d8
 ---> bf1226931dee
Step 29/32 : RUN echo '{"DBName":"grossolini_as","SchemaPrefix":"","Seed":0,"ColumnMaps":[{"Comment":"","TableSchema":"as","TableName":"as_streams_search","ColumnName":"v3_deprecated_login","DataType":"character varying","ParentSchema":"","ParentTable":"","ParentColumn":"","OrdinalPosition":16,"IsNullable":true,"Processors":[{"Name":"AlphaNumericScrambler","Max":0,"Min":0,"Variance":0,"Comment":""}]},{"Comment":"","TableSchema":"as","TableName":"as_streams_search","ColumnName":"v3_deprecated_password","DataType":"character varying","ParentSchema":"","ParentTable":"","ParentColumn":"","OrdinalPosition":17,"IsNullable":true,"Processors":[{"Name":"ScrubString","Max":0,"Min":0,"Variance":0,"Comment":""}]}]}' | jq > $SKEL_FILE
 ---> Running in d4127754615f
Removing intermediate container d4127754615f
 ---> fd635a371ac3
Step 30/32 : RUN touch $TMP_FILE && rm $TMP_FILE
 ---> Running in 3441491edaf9
Removing intermediate container 3441491edaf9
 ---> dbdf302a46e6
Step 31/32 : RUN gonymizer dump --dump-file="$TMP_FILE" -S -H "$PG_SRC_HOST" -p "$PG_SRC_PSWD" -U "$PG_SRC_USER" -P "$PG_SRC_PORT" -d "$PG_SRC_DBNAME" --schema "$PG_SRC_SCHEMA" -L "$DBG_LVL"
 ---> Running in 2a7de0e1e1f8
time="2020-09-08 09:55:36.938" level=debug msg="๐Ÿ \x1b[1;32m configuration \x1b[0m ๐Ÿ‘‡"
Aliases:
map[string]string{}
Override:
map[string]interface {}{}
time="2020-09-08 09:55:36.938" level=debug msg="๐Ÿ \x1b[1;32m configuration \x1b[0m โ˜๏ธ"
time="2020-09-08 09:55:36.938" level=debug msg="os.Args: [gonymizer dump --dump-file=grossolini_as.tmp.sql -S **REMOVED** -L DEBUG]"
PFlags:
map[string]viper.FlagValue{"config":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d06e0)}, "database":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0c80)}, "disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0960)}, "dump-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d1180)}, "dump.database":viper.pflagValue{flag:(*pflag.Flag)(0xc000095c20)}, "dump.disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc000095900)}, "dump.dump-file":viper.pflagValue{flag:(*pflag.Flag)(0xc000095cc0)}, "dump.exclude-schemas":viper.pflagValue{flag:(*pflag.Flag)(0xc000095ae0)}, "dump.exclude-table":viper.pflagValue{flag:(*pflag.Flag)(0xc0000959a0)}, "dump.exclude-table-data":viper.pflagValue{flag:(*pflag.Flag)(0xc000095a40)}, "dump.host":viper.pflagValue{flag:(*pflag.Flag)(0xc000095b80)}, "dump.password":viper.pflagValue{flag:(*pflag.Flag)(0xc000095ea0)}, "dump.port":viper.pflagValue{flag:(*pflag.Flag)(0xc000095f40)}, "dump.row-count-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d05a0)}, "dump.schema":viper.pflagValue{flag:(*pflag.Flag)(0xc000095d60)}, "dump.username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d00a0)}, "exclude-schemas":viper.pflagValue{flag:(*pflag.Flag)(0xc000095ae0)}, "exclude-table":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0aa0)}, "exclude-table-data":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0b40)}, "generate-seed":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d1040)}, "help":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d1400)}, "host":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0be0)}, "load.database":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0280)}, "load.disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0140)}, "load.host":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d01e0)}, "load.load-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0320)}, "load.password":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0460)}, "load.port":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0500)}, "load.skip-procedures":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d03c0)}, "load.username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0640)}, "log-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0780)}, "log-format":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d08c0)}, "log-level":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0820)}, "map-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d10e0)}, "map.database":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0c80)}, "map.disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0960)}, "map.exclude-table":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0aa0)}, "map.exclude-table-data":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0b40)}, "map.host":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0be0)}, "map.map-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0a00)}, "map.password":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0e60)}, "map.port":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0f00)}, "map.schema":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0d20)}, "map.schema-prefix":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0dc0)}, "map.username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0fa0)}, "password":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0e60)}, "port":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0f00)}, "post-process-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d12c0)}, "process.dump-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d1180)}, "process.generate-seed":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d1040)}, "process.map-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d10e0)}, "process.post-process-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d12c0)}, "process.processed-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d1220)}, "process.s3-file-path":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d1360)}, "processed-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d1220)}, "row-count-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0000)}, "s3-file-path":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d1360)}, "schema":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0d20)}, "schema-prefix":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0dc0)}, "username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001d0fa0)}}
Env:
map[string]string{}
Key/Value Store:
map[string]interface {}{}
Config:
map[string]interface {}{}
Defaults:
map[string]interface {}{}
time="2020-09-08 09:55:36.938" level=debug msg="Starting gonymizer (v1.1.1, build 5, build date:2019-05-06 23:50:02 +0000 UTC)"
time="2020-09-08 09:55:36.938" level=debug msg="Go (runtime:go1.12.6) (GOMAXPROCS:2) (NumCPUs:2)\n"
time="2020-09-08 09:55:36.938" level=info msg="\x1b[1;33mEnabling log level: DEBUG\x1b[0m"
time="2020-09-08 09:55:41.963" level=info msg="๐Ÿšœ \x1b[1;32mCreating dump file\x1b[0m ๐Ÿšœ"
time="2020-09-08 09:55:41.963" level=info msg="Dumping the following schemas: [as]"
time="2020-09-08 09:55:41.963" level=debug msg="Running command: pg_dump  **REMOVED** --oids --no-owner --schema=as.*"
time="2020-09-08 09:56:34.741" level=info msg="๐Ÿฆ„ \x1b[1;32m-- SUCCESS --\x1b[0m ๐ŸŒˆ"
Removing intermediate container 2a7de0e1e1f8
 ---> 08b12d84887b
Step 32/32 : RUN gonymizer process --map-file="$SKEL_FILE" --dump-file="$TMP_FILE" --processed-file="$DUMP_FILE" --generate-seed -L "$DBG_LVL"
 ---> Running in a99a167a770a
Aliases:
map[string]string{}
Override:
map[string]interface {}{}
time="2020-09-08 09:56:40.034" level=debug msg="๐Ÿ \x1b[1;32m configuration \x1b[0m ๐Ÿ‘‡"
PFlags:
map[string]viper.FlagValue{"config":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac640)}, "database":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acbe0)}, "disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac8c0)}, "dump-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ad0e0)}, "dump.database":viper.pflagValue{flag:(*pflag.Flag)(0xc000079b80)}, "dump.disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc000079860)}, "dump.dump-file":viper.pflagValue{flag:(*pflag.Flag)(0xc000079c20)}, "dump.exclude-schemas":viper.pflagValue{flag:(*pflag.Flag)(0xc000079a40)}, "dump.exclude-table":viper.pflagValue{flag:(*pflag.Flag)(0xc000079900)}, "dump.exclude-table-data":viper.pflagValue{flag:(*pflag.Flag)(0xc0000799a0)}, "dump.host":viper.pflagValue{flag:(*pflag.Flag)(0xc000079ae0)}, "dump.password":viper.pflagValue{flag:(*pflag.Flag)(0xc000079e00)}, "dump.port":viper.pflagValue{flag:(*pflag.Flag)(0xc000079ea0)}, "dump.row-count-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac500)}, "dump.schema":viper.pflagValue{flag:(*pflag.Flag)(0xc000079cc0)}, "dump.username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac000)}, "exclude-schemas":viper.pflagValue{flag:(*pflag.Flag)(0xc000079a40)}, "exclude-table":viper.pflagValue{flag:(*pflag.Flag)(0xc0001aca00)}, "exclude-table-data":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acaa0)}, "generate-seed":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acfa0)}, "help":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ad360)}, "host":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acb40)}, "load.database":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac1e0)}, "load.disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac0a0)}, "load.host":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac140)}, "load.load-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac280)}, "load.password":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac3c0)}, "load.port":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac460)}, "load.skip-procedures":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac320)}, "load.username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac5a0)}, "log-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac6e0)}, "log-format":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac820)}, "log-level":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac780)}, "map-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ad040)}, "map.database":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acbe0)}, "map.disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac8c0)}, "map.exclude-table":viper.pflagValue{flag:(*pflag.Flag)(0xc0001aca00)}, "map.exclude-table-data":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acaa0)}, "map.host":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acb40)}, "map.map-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ac960)}, "map.password":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acdc0)}, "map.port":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ace60)}, "map.schema":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acc80)}, "map.schema-prefix":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acd20)}, "map.username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acf00)}, "password":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acdc0)}, "port":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ace60)}, "post-process-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ad220)}, "process.dump-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ad0e0)}, "process.generate-seed":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acfa0)}, "process.map-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ad040)}, "process.post-process-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ad220)}, "process.processed-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ad180)}, "process.s3-file-path":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ad2c0)}, "processed-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ad180)}, "row-count-file":viper.pflagValue{flag:(*pflag.Flag)(0xc000079f40)}, "s3-file-path":viper.pflagValue{flag:(*pflag.Flag)(0xc0001ad2c0)}, "schema":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acc80)}, "schema-prefix":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acd20)}, "username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001acf00)}}
Env:
map[string]string{}
Key/Value Store:
map[string]interface {}{}
Config:
map[string]interface {}{}
Defaults:
map[string]interface {}{}
time="2020-09-08 09:56:40.034" level=debug msg="๐Ÿ \x1b[1;32m configuration \x1b[0m โ˜๏ธ"
time="2020-09-08 09:56:40.034" level=debug msg="os.Args: [gonymizer process --map-file=grossolini_as.skeleton.json --dump-file=grossolini_as.tmp.sql --processed-file=grossolini_as.dump.sql --generate-seed -L DEBUG]"
time="2020-09-08 09:56:40.034" level=debug msg="Starting gonymizer (v1.1.1, build 5, build date:2019-05-06 23:50:02 +0000 UTC)"
time="2020-09-08 09:56:40.034" level=debug msg="Go (runtime:go1.12.6) (GOMAXPROCS:2) (NumCPUs:2)\n"
time="2020-09-08 09:56:40.034" level=info msg="\x1b[1;33mEnabling log level: DEBUG\x1b[0m"
time="2020-09-08 09:56:40.034" level=debug msg="s3-file-path: "
time="2020-09-08 09:56:40.034" level=debug msg="S3 URL: <nil>\tScheme: \tBucket: \tRegion: \tFile Path: "
time="2020-09-08 09:56:40.034" level=info msg="๐Ÿšœ \x1b[1;32mProcessing dump file\x1b[0m ๐Ÿšœ"
time="2020-09-08 09:56:40.034" level=info msg="Loading map file from: grossolini_as.skeleton.json"
time="2020-09-08 09:56:40.035" level=info msg="Processing dump file: grossolini_as.tmp.sql"
time="2020-09-08 09:56:40.035" level=debug msg="Using internal number generator for seed value: 2135716918243314402"

[ ...a bunch of Schema.Table logs... ]

time="2020-09-08 09:56:54.899" level=info msg="Processing line number: 2900000"
time="2020-09-08 09:56:55.249" level=info msg="Processing line number: 3000000"
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/smithoss/gonymizer.(*LineState).parseCopyLine(0xc000373928, 0xc000260aa0, 0x13)
        /usr/local/go/src/github.com/smithoss/gonymizer/generator.go:404 +0x5ac
github.com/smithoss/gonymizer.processLine(0xc00005d480, 0xc000373928, 0xc000260aa0, 0x13, 0xc000260aa0, 0x13, 0x0, 0x0, 0x0)
        /usr/local/go/src/github.com/smithoss/gonymizer/generator.go:301 +0x276
github.com/smithoss/gonymizer.ProcessDumpFile(0xc00005d480, 0x7ffd256fdd75, 0x15, 0x7ffd256fdd9c, 0x16, 0x0, 0x0, 0xaca301, 0x0, 0x0)
        /usr/local/go/src/github.com/smithoss/gonymizer/generator.go:200 +0x8de
main.process(0x7ffd256fdd75, 0x15, 0x7ffd256fdd4d, 0x1b, 0x7ffd256fdd9c, 0x16, 0x1, 0xc00007ea00, 0xc000157b30, 0x0)
        /usr/local/go/src/github.com/smithoss/gonymizer/command/process.go:136 +0x20d
main.cliCommandProcess(0x126ea60, 0xc000060ba0, 0x0, 0x6)
        /usr/local/go/src/github.com/smithoss/gonymizer/command/process.go:111 +0x830
github.com/spf13/cobra.(*Command).execute(0x126ea60, 0xc000060a80, 0x6, 0x6, 0x126ea60, 0xc000060a80)
        /go/pkg/mod/github.com/spf13/[email protected]/command.go:766 +0x2ae
github.com/spf13/cobra.(*Command).ExecuteC(0x126e5a0, 0x0, 0x0, 0xc00015a900)
        /go/pkg/mod/github.com/spf13/[email protected]/command.go:852 +0x2ec
github.com/spf13/cobra.(*Command).Execute(...)
        /go/pkg/mod/github.com/spf13/[email protected]/command.go:800
main.main()
        /usr/local/go/src/github.com/smithoss/gonymizer/command/main.go:116 +0x32
The command '/bin/sh -c gonymizer process --map-file="$SKEL_FILE" --dump-file="$TMP_FILE" --processed-file="$DUMP_FILE" --generate-seed -L "$DBG_LVL"' returned a non-zero code: 2

Here is the output using the junkert/gonymizer image:

Step 25/32 : RUN pg_dump --version && pg_restore --version && gonymizer version
 ---> Running in dee19173bf66
pg_dump (PostgreSQL) 12.3
pg_restore (PostgreSQL) 12.3
gonymizer (v1.2.0, build 10, build date:2019-07-31 17:23:35 +0000 UTC)
Go (runtime:go1.13.15) (GOMAXPROCS:2) (NumCPUs:2)
Removing intermediate container dee19173bf66
 ---> a344329f4c4b
Step 26/32 : RUN apk update
 ---> Running in 5a2dcd65cd28
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
v3.12.0-295-gaaa7f43dbe [http://dl-cdn.alpinelinux.org/alpine/v3.12/main]
v3.12.0-301-g6a4ffc91b5 [http://dl-cdn.alpinelinux.org/alpine/v3.12/community]
OK: 12743 distinct packages available
Removing intermediate container 5a2dcd65cd28
 ---> c50cfe680df3
Step 27/32 : RUN apk add jq
 ---> Running in 7ee73a81b6a9
(1/2) Installing oniguruma (6.9.5-r1)
(2/2) Installing jq (1.6-r1)
Executing busybox-1.31.1-r19.trigger
OK: 143 MiB in 34 packages
Removing intermediate container 7ee73a81b6a9
 ---> 4a6b2d3c0e5c
Step 28/32 : RUN touch $SKEL_FILE && rm $SKEL_FILE
 ---> Running in f7c9b3e50d61
Removing intermediate container f7c9b3e50d61
 ---> 8a9bb573a6be
Step 29/32 : RUN echo '{"DBName":"grossolini_as","SchemaPrefix":"","Seed":0,"ColumnMaps":[{"Comment":"","TableSchema":"as","TableName":"as_streams_search","ColumnName":"v3_deprecated_login","DataType":"character varying","ParentSchema":"","ParentTable":"","ParentColumn":"","OrdinalPosition":16,"IsNullable":true,"Processors":[{"Name":"AlphaNumericScrambler","Max":0,"Min":0,"Variance":0,"Comment":""}]},{"Comment":"","TableSchema":"as","TableName":"as_streams_search","ColumnName":"v3_deprecated_password","DataType":"character varying","ParentSchema":"","ParentTable":"","ParentColumn":"","OrdinalPosition":17,"IsNullable":true,"Processors":[{"Name":"ScrubString","Max":0,"Min":0,"Variance":0,"Comment":""}]}]}' | jq > $SKEL_FILE
 ---> Running in e9f771655822
Removing intermediate container e9f771655822
 ---> 976c25ed8ce8
Step 30/32 : RUN touch $TMP_FILE && rm $TMP_FILE
 ---> Running in f61f78f618a5
Removing intermediate container f61f78f618a5
 ---> 1b0f4064cfd2
Step 31/32 : RUN gonymizer dump --dump-file="$TMP_FILE" -S -H "$PG_SRC_HOST" -p "$PG_SRC_PSWD" -U "$PG_SRC_USER" -P "$PG_SRC_PORT" -d "$PG_SRC_DBNAME" --schema "$PG_SRC_SCHEMA" -L "$DBG_LVL"
 ---> Running in d84f19600f4a
time="2020-09-08 10:03:48.473" level=debug msg="os.Args: [gonymizer dump --dump-file=grossolini_as.tmp.sql -S **REMOVED** -L DEBUG]"
time="2020-09-08 10:03:48.473" level=debug msg="๐Ÿ  configuration  ๐Ÿ‘‡"
Aliases:
map[string]string{}
Override:
map[string]interface {}{}
PFlags:
map[string]viper.FlagValue{"config":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3220)}, "database":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a37c0)}, "disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a34a0)}, "dump-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3cc0)}, "dump.database":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2780)}, "dump.disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2460)}, "dump.dump-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2820)}, "dump.exclude-schema":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2640)}, "dump.exclude-table":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2500)}, "dump.exclude-table-data":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a25a0)}, "dump.host":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a26e0)}, "dump.password":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2a00)}, "dump.port":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2aa0)}, "dump.row-count-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a30e0)}, "dump.schema":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a28c0)}, "dump.username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2be0)}, "exclude-schema":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2640)}, "exclude-table":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a35e0)}, "exclude-table-data":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3680)}, "generate-seed":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3b80)}, "help":viper.pflagValue{flag:(*pflag.Flag)(0xc0001f2140)}, "host":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3720)}, "inclusive":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3d60)}, "load.database":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2dc0)}, "load.disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2c80)}, "load.host":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2d20)}, "load.load-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2e60)}, "load.password":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2fa0)}, "load.port":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3040)}, "load.skip-procedures":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2f00)}, "load.username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3180)}, "local-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001f20a0)}, "log-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a32c0)}, "log-format":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3400)}, "log-level":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3360)}, "map-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3c20)}, "map.database":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a37c0)}, "map.disable-ssl":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a34a0)}, "map.exclude-table":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a35e0)}, "map.exclude-table-data":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3680)}, "map.host":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3720)}, "map.map-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3540)}, "map.password":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a39a0)}, "map.port":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3a40)}, "map.schema":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3860)}, "map.schema-prefix":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3900)}, "map.username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3ae0)}, "password":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a39a0)}, "port":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3a40)}, "post-process-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3f40)}, "pre-process-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3ea0)}, "process.dump-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3cc0)}, "process.generate-seed":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3b80)}, "process.inclusive":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3d60)}, "process.map-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3c20)}, "process.post-process-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3f40)}, "process.pre-process-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3ea0)}, "process.processed-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3e00)}, "processed-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3e00)}, "row-count-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a2b40)}, "s3-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001f2000)}, "schema":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3860)}, "schema-prefix":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3900)}, "upload.local-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001f20a0)}, "upload.s3-file":viper.pflagValue{flag:(*pflag.Flag)(0xc0001f2000)}, "username":viper.pflagValue{flag:(*pflag.Flag)(0xc0001a3ae0)}}
Env:
map[string]string{}
Key/Value Store:
map[string]interface {}{}
Config:
map[string]interface {}{}
Defaults:
map[string]interface {}{}
time="2020-09-08 10:03:48.473" level=debug msg="๐Ÿ  configuration  โ˜๏ธ"
time="2020-09-08 10:03:48.473" level=debug msg="Starting gonymizer (v1.2.0, build 10, build date: 2019-07-31 17:23:35 +0000 UTC)"
time="2020-09-08 10:03:48.473" level=debug msg="Go (runtime: go1.13.15) (GOMAXPROCS: 2) (NumCPUs: 2)
"
time="2020-09-08 10:03:48.473" level=info msg="Enabling log level: DEBUG"
time="2020-09-08 10:03:48.506" level=info msg="๐Ÿšœ Creating dump file ๐Ÿšœ"
time="2020-09-08 10:03:48.506" level=debug msg="Running command: pg_dump --oids --no-owner --schema=as.* -f grossolini_as.tmp.sql **REMOVED**"
time="2020-09-08 10:03:48.509" level=error msg="exit status 1"
time="2020-09-08 10:03:48.509" level=debug msg="name: pg_dump"
time="2020-09-08 10:03:48.509" level=debug msg="arg: [--oids --no-owner --schema=as.* -f grossolini_as.tmp.sql **REMOVED**]"
time="2020-09-08 10:03:48.509" level=debug msg="errBytes:
=====================
pg_dump: unrecognized option: oids
Try "pg_dump --help" for more information.

=====================
"
time="2020-09-08 10:03:48.509" level=debug msg="errBytes:
=====================
pg_dump: unrecognized option: oids
Try "pg_dump --help" for more information.

=====================
"
time="2020-09-08 10:03:48.509" level=error msg="STDOUT: "
time="2020-09-08 10:03:48.509" level=error msg="STDERR: pg_dump: unrecognized option: oids
Try "pg_dump --help" for more information.
"
time="2020-09-08 10:03:48.509" level=error msg="exit status 1"
time="2020-09-08 10:03:48.509" level=error msg="exit status 1"
time="2020-09-08 10:03:48.509" level=error msg="โŒ Gonymizer did not exit properly. See above for errors โŒ"
The command '/bin/sh -c gonymizer dump --dump-file="$TMP_FILE" -S -H "$PG_SRC_HOST" -p "$PG_SRC_PSWD" -U "$PG_SRC_USER" -P "$PG_SRC_PORT" -d "$PG_SRC_DBNAME" --schema "$PG_SRC_SCHEMA" -L "$DBG_LVL"' returned a non-zero code: 1

Help screen examples are incorrect

When I was reviewing the help screen for gonymizer --help I noticed that some of the examples contain flags that are incorrect and are deprecated. Fix these.

Mechanism for reducing dump size by limiting rows per table?

Thanks for this tool. It looks pretty great.

I'd like to both anonymize my data as well as decrease the size of the overall database size. Is there a mechanism such that I could specify a maximum number of records for a particular table, and delete any records prior to that maximum set?

Discuss which string edit-difference algorithms should be supported

Currently we are using the Jaro-Winkler distance on most of our Faker output to make sure that we do not replace the data with anything that could be similar to the original. (see: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)
Code: https://github.com/smithoss/gonymizer/blob/master/processors.go#L220

I'm not even sure if this is the right one. I did some research when I first added this to the project, but after reviewing other algorithms, I found that others (based on Levenshtein distance as well) may be a better fit.

Also are these checks pointless since our string size is not the same except for the AlphaNumericScrambler?

Create example configuration

We forgot to add example configurations for the dump, process, and load commands. We should also link directly into the code base from the documentation of examples using the testing/test_db

ENUMs are not included in master dump file

We noticed today when someone added a new ENUM type that Gonymizer is completely ignoring the creation of these types. This is a huge bug and needs to be fixes ASAP.

To recreate:

  1. Create a ENUM type in the test database
  2. Run the dump command
  3. Search PII/PHI dump file for 'CREATE TYPE'; if count > 0 then we have fixed the issue

AlphaNumericScrambler breaks escaped encoding input

I tried to join the slack channel as per https://github.com/smithoss/gonymizer/blob/master/CONTRIBUTING.md but failed due to not having an existing account nor invitation at smith-oss.slack.com :(

AlphaNumericScrambler uses scrambleString, which naively replaces ASCII letters and numbers. When input contains eg. newlines dumped with pg_dump as '\n', some of these will eventually turn into '\x', which, when read by psql, consumes a two-digit hexadecimal number and injects corresponding byte into input stream, which then either turns that into some fancy unicode character, or more likely stumbles upon an invalid bytecode sequence.

Suggestion for fix: retain backslash escape sequences as is. Replace '\x..' with known-good values only.

I can provide said fix. Would you like to provide insight into it?

It should be possible to reproduce this along the following lines:

  1. Create a database with utf-8 encoding and a 'text' column.
  2. Insert large text with newlines.
  3. Create a dump like pg_dump --if-exists --clean "$DATABASE_URL" > dump.sql
  4. Create a matching gonymizer config including AlphaNumericScrambler, eg.
        {
            "Comment": "email, name or other free text",
            "TableSchema": "public",
            "TableName": "redacted",
            "ColumnName": "redacted",
            "DataType": "text",
            "ParentSchema": "",
            "ParentTable": "",
            "ParentColumn": "",
            "OrdinalPosition": 12,
            "IsNullable": true,
            "Processors": [
                {
                    "Name": "AlphaNumericScrambler",
                    "Max": 0,
                    "Min": 0,
                    "Variance": 0,
                    "Comment": ""
                }
            ]
        },
  1. Run gonymizer on the dump like this (BTW it'd be super neat to be able to stream a dump through gonymizer instead)
    gonymizer process --map-file "$THIS_DIR/gonymizer-map.json" \
        --dump-file "dump.sql" \
        --processed-file "pseudonymized.sql"
  1. Try to restore the processed dump as a database: psql -qd "dbname" -f "pseudonymized.sql"; with luck (improving with number of backslash-escapes in dump) you'll get errors like this actual one:
psql:<stdin>:5618: ERROR:  invalid byte sequence for encoding "UTF8": 0x97
CONTEXT:  COPY redacted, line 2723: "731	2020-08-17 11:40:14.697455	2020-09-08 04:16:11.137209	2020-08-27 06:00:00+00	2020-08-27 11:00:00..."

Troubleshoot why coveralls broke

I removed this for our 2.0 release to get the release through, but removed Coveralls in the process. We should add this or Codecov back to the project

Avoid processing data for the ignored tables

Hi,

Currently, all table data is passed through the processors, even if none of the fields of the table have a meaningful processor defined.
This means a big slowdown for certain databases, with the increased possibility to hit a bug like #87.

Perhaps it would be possible to stream to the outfile the data for those tables, without passing it through the processors?

Wrap pg_dump's ability to dump only a few tables

Hi,

Conversely to the ability to exclude tables from the dump, which gonymizer already wraps, pg_dump also has the ability to specify exactly which tables should be exported:

  -t, --table=TABLE            dump the named table(s) only

This would be nice at the prototyping phase with gonymizer: just focus on a few tables, don't wait too much on the network.

Cheers

Create an option for loading a pre-processed file to the start of the processed dump file

We recently hit an issue in production where the uuid-ossp plugin was not being loaded during the load process. This is because we are not adding:

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

to the top of our processed dump file.

To allow for this we should add an option to the process step for a 'pre-process' file that will be inserted to the top of the processed dump file. This will allow the user to specify plugins as well as other SQL statements that may be needed before importing the schema and table data.

Slack button is currently broken

When trying to join as a new user on the Slack button in the README.md the user is never sent an e-mail to join the Slack channel. This should be looked into and fixed.

A dump with postgres version 10 generates oid error

I could not generate a dump of my postgres tables. Here are few details:

  1. Version of Gonymizer: gonymizer (v1.2.0, build 10, build date:2019-08-01 03:23:35 +1000 AEST)

  2. Command that I ran: gonymizer -c anonymiser.json dump

  3. version of psql: psql (PostgreSQL) 12.4 (Ubuntu 12.4-0ubuntu0.20.04.1)

  4. version of postgres: 10

  5. Command executed with gonymizer:
    time="2020-10-21 18:46:38.179" level=debug msg="Running command: pg_dump --oids --no-owner --schema=public.* -f phi_dump.sql

  6. Errors seen: postgresql/12/bin/pg_dump: unrecognized option '--oids'

  7. Full Log: attached
    gonymizer_dump.log

This error is already communicated with Levi Junkert.

Fix linting issues

Currently we have a lot of linting issues that should be addressed:

$ golint
db_client.go:234:1: comment on exported function GetTableRowCountsInDB should be of the form "GetTableRowCountsInDB ..."
db_util.go:25:1: comment on exported method PGConfig.LoadFromCLI should be of the form "LoadFromCLI ..."
db_util.go:26:1: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self"
db_util.go:41:1: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self"
db_util.go:61:1: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self"
db_util.go:79:1: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self"
db_util.go:90:1: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self"
db_util.go:106:1: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self"
db_util.go:111:1: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self"
generator.go:23:2: don't use ALL_CAPS in Go names; use CamelCase
generator.go:23:2: exported const STATE_CHANGE_TOKEN_BEGIN_COPY should have comment (or a comment on this block) or be unexported
generator.go:24:2: don't use ALL_CAPS in Go names; use CamelCase
generator.go:37:1: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self"
generator.go:419:1: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self"
generator_test.go:100:30: error strings should not be capitalized or end with punctuation or a newline
loader.go:41:10: should replace errors.New(fmt.Sprintf(...)) with fmt.Errorf(...)
loader.go:150:2: should replace lineNum += 1 with lineNum++
mapper.go:54:1: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self"
mapper.go:69:1: exported method DBMapper.Validate should have comment or be unexported
mapper.go:245:26: error strings should not be capitalized or end with punctuation or a newline
processors.go:49:1: comment on exported var AlphaNumericMap should be of the form "AlphaNumericMap ..."
processors.go:53:1: comment on exported var UUIDMap should be of the form "UUIDMap ..."
processors.go:184:14: should replace errors.New(fmt.Sprintf(...)) with fmt.Errorf(...)
processors.go:185:9: if block ends with a return statement, so drop this else and outdent its block
processors.go:189:15: should replace errors.New(fmt.Sprintf(...)) with fmt.Errorf(...)
processors.go:204:2: var inputId should be inputID
processors.go:232:3: should replace counter += 1 with counter++
processors.go:241:3: var finalId should be finalID
processors_test.go:48:2: don't use underscores in Go names; var output_a should be outputA
processors_test.go:50:2: don't use underscores in Go names; var output_b should be outputB
processors_test.go:52:2: don't use underscores in Go names; var output_c should be outputC
processors_test.go:234:2: don't use underscores in Go names; var temp_uuid should be tempUUID
s3.go:23:2: struct field Url should be URL
s3.go:27:1: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self"
version.go:6:2: don't use underscores in Go names; const internal_BUILD_TIMESTAMP should be internalBUILDTIMESTAMP
version.go:7:2: don't use underscores in Go names; const internal_BUILD_NUMBER should be internalBUILDNUMBER
version.go:8:2: don't use underscores in Go names; const internal_VERSION_STRING should be internalVERSIONSTRING

Add dump server metadata to top of processed files

@wavemoran brought up a good point today during a quick sprint demo of the project.

What would happen if we swapped the loader and processor server information in the configuration between the processing stage and loading stage (same as dump, processor, and loader all having identical server, database, and credentials set)? In this case we would be taking a dump file from the same server we would be loading into (same database, schemas, allthethings). This would cause issues if the dump and load servers were both production servers ๐Ÿ˜ฑ... we do, however, make sure to NEVER run a DROP DATABASE during any execution paths in the load command pathways. Instead we do an atomic renaming of the primary database we are loading into by adding a timestamp to the name. Since we rename the primary database to the timestamp name we can then rename the anonymized database to the primary name.

We can fix this issue by adding metadata such as: hostname, port, and database to the pre-processor. This way we can read this data during the load command and verify that the server we are loading into is not the same server we dumped from.

Pair of requests to the wishlist

  1. I know that Postgres is case sensitive, but is it possible for the tool to specify PostgreSQL to maintain the lower-upper case present in the gonymizer configuration file when creating the database in which the data will be loaded.

  2. Could you add a flag/mechanism in the gonymizer config file so that the pg_dump command does not specify the schema flag? I had cases in which extensions are not being dumped in the backup file and this is because how the schema flag works

Add more tests

Our code coverage is lacking in some areas. For example processor.go > 90%, but generator.go < 60% ๐Ÿ˜ญ ๐Ÿผ

Start by adding tests in Generator.go

Change to testify.Require instead of testify.Assert where needed

Don't need to change it but I assume you're checking for a nil error return. The testify libraries have a specific NoError function.

Also, can be changed later if you want to, I generally recommend use of testify/require rather than testify/assert. Exactly the same set of functions, you can s/assert/require/. The difference is assert functions will continue the test even if the assert fails. That can often result in a nil pointer issue that causes a panic and big stack trace, making the test output harder to read. The require methods will hard fail and the rest of the specific test won't run (other tests will still get run).

Originally posted by @endophage in #5

Add checks to DBMapper.Validate()

Currently we are only checking the map against the DBMapper.DBName value and making sure it's length is > 0. There are many checks we can be doing here to validate that the DBMap is in the correct format. One example is to check the structure of the DBMap and verify all variables are in the correct form.

Support processor parameters

Some functions in fake package accept arguments, e.g. DigitsN. Does Gonymizer support passing parameters to processors? Seems like it doesn't, but I might be missing something. If not, do you have any plan on this?

Concurrent processor

Hey,

First off, thanks for sharing this great tool! It's been really helpful to me for a recent assignment I was tasked to do.

I'm not sure if you're interested in unsolicited feature PRs but I have a working concurrent version of gonymizer here. It's currently functional but still very rough around the edges and not ready to be reviewed/merged and I just wanted to see if you would be interested in getting it merged back upstream in future.

To give you a bit of background about these changes, I was looking into anonymizing customer database dumps and your tool came up ahead of the competition in terms of fit for our use-case. I added a couple more processors for our specific requirements and it performed well. However, in our testing of our own test data (1.2GB) we found that it took ~3.5hours to finish processing the data. We expect our customer's data to be at least 19GB and would thus take too long to process on the customer's machine. I then spent a weekend optimizing gonymizer to try to get that time down and managed to get it down to only 2 minutes for the same 1.2GB dump running with 2 workers. However, these changes are fairly substantial and I may not have time to clean up the code/write tests/etc. for it to be mergeable.

The key changes (if you want to adapt some of these changes piecemeal) are:

  1. concurrent processing with goroutines (this is a large change and requires changing the data structures in the processors to be concurrency-safe)-- this dropped ~3.5hours -> 4 minutes
  2. changing the list of maps to be a map of maps-- this is a much smaller change and dropped 4 minutes to 2 minutes, though I'm not sure how much this would speed it up on its own-- perhaps something you can try if you're interested.
  3. regex processing the COPY line instead of slicing the line multiple times-- I tried this much earlier than everything else on a smaller dataset and it dropped the time a tiny bit consistently.

If you want to try it out on your own data, feel free to check out my fork. You need to add max-workers: N where N is the number of workers (I recommend at least 2) to the config yaml file and also "MaxLength" field to each of your ColumnMaps (unrelated to any of the changes above but unfortunately part of the commit because I didn't have enough time to split the changes up into different commits).

The remaining work:

  • Update tests
  • Write more tests
  • Update README with changes
  • Split off changes into separate branch from my fork

Cheers,
Emin

Add S3 support to all input/output files

We need to support AWS-Lambda which will allow Gonymizer to run as a serverless application. To do this we need to be able to store all files in S3 (not just the processed file from the dump command and row_counts file from the dump and load comma). This should be pretty easy to implement since we already have the S3 functions in s3.go

Create Processor for TSRange

Currently we do not handle the TSRange data type when dealing with dates. We need to make sure we handle date ranges properly when scrambling range dates and timestamps.

Create Docker Hub Images

We should create basic Docker images for each process that is used with some example shell scripts for the: dump, process, and load commands. This should help others get up and going without having to build the project.

Include only specified columns in the anonymized dump

Hello gonymizer team,

we are currently evaluating gonymizer for our use cases.

We want to archive the following (simplified) use case:

  • Table T with columns A,B
  • Specified map file with only columns A,B
  • whenever a new column C with sensitive data is added to table T, C will not be included in the anonymized dump

Our experiments showed that even tough the column C was not specified in the map file it was still included in the dump.

Is there a way for the desired behavior?

Create Developer Documentation

We need documentation on how to setup the development environment for new people to the project. Include:

  • Overview of test harness
  • CircleCI and Docker containers for:
    • building Gonymizer binary
    • test verification
    • Docker image builds
  • Docker containers for testing
  • Docker PostgreSQL developer env

Reduce memory usage during the dump process

We recently had a production issue where we were running out of memory (4G!) while running the pg_dump command. This was quite confusing until I found this article:

https://stackoverflow.com/questions/50345177/how-to-limit-pg-dumps-memory-usage

One of the comments at the bottom points out that not using the โ€˜-fโ€™ will cause memory to increase during the dump process when using the stdout from exec and piping it an open file handle seen here:

https://github.com/smithoss/gonymizer/blob/master/generator.go#L103-L108

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.