rheinwerk-verlag / pganonymize Goto Github PK
View Code? Open in Web Editor NEWA commandline tool for anonymizing PostgreSQL databases
Home Page: http://pganonymize.readthedocs.io/
License: Other
A commandline tool for anonymizing PostgreSQL databases
Home Page: http://pganonymize.readthedocs.io/
License: Other
Hi, @hkage! Can you please release a new version with latest changes?
https://github.com/MagicStack/asyncpg
More an idea than a necessary feature, because asyncpg requires Python 3.5 or later.
Version: pganonymize-0.5.0
INFO: Found table definition "users"
Anonymizing |████████████████████████████████| 5656/5656
Traceback (most recent call last):
File "/usr/local/bin/pganonymize", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/pganonymizer/__main__.py", line 10, in main
main()
File "/usr/local/lib/python3.8/dist-packages/pganonymizer/cli.py", line 71, in main
anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 43, in anonymize_tables
import_data(connection, column_dict, table_name, table_columns, primary_key, data)
File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 154, in import_data
copy_from(connection, data, temp_table, table_columns)
File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 132, in copy_from
cursor.copy_from(new_data, table, sep=COPY_DB_DELIMITER, null='\\N', columns=quoted_cols)
psycopg2.errors.UndefinedColumn: column ""id"" of relation "tmp_users" does not exist
If a schema has an invalid structure, the resulting errors are a bit confusing. E.g. if the "tables" definition is missing, the resulting error looks like this:
Traceback (most recent call last):
File "/home/henning/.local/share/virtualenvs/postgresql-anonymizer-AUSqld0C/bin/pganonymize", line 11, in <module>
load_entry_point('pganonymize', 'console_scripts', 'pganonymize')()
File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/cli.py", line 32, in main
truncate_tables(connection, schema.get('truncate', []))
AttributeError: 'list' object has no attribute 'get'
It would ne nice to have a validation method that checks for the basic structure of the YAML file:
tables
keywordfields
at the table levelprovider
at the field level, with at least a name
Instead of writing own methods for the validation it could be possible to use general Python based YAML validation libraries, e.g.:
Script fails when a table column as upper case characters
Traceback (most recent call last):
File "/usr/local/bin/pganonymize", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/pganonymizer/__main__.py", line 10, in main
main()
File "/usr/local/lib/python3.9/site-packages/pganonymizer/cli.py", line 71, in main
anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 39, in anonymize_tables
import_data(connection, column_dict, table_name, table_columns, primary_key, data)
File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 140, in import_data
copy_from(connection, data, 'source', table_columns)
File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 119, in copy_from
cursor.copy_from(new_data, table, sep=COPY_DB_DELIMITER, null='\\N', columns=columns)
psycopg2.errors.UndefinedColumn: column "createdat" of relation "source" does not exist
parmap release 1.5.3
dropped support for older versions of Python.
Relevant changelog entry:
parmap (1.5.3)
Downgrading my parmap to 1.5.2 does fix the issue on older versions of Python.
Hey,
I get this error when trying to anonymize any table which primary key id uses uuid4:
INFO: Found table definition "users"
1216it [00:00, 38715.87it/s]
Processing 1 batches for users: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 210, in f
return formatter(v)
File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 109, in uuid_formatter
return 'i2Q', (16, (guid.int >> 64) & MAX_INT64, guid.int & MAX_INT64)
AttributeError: 'str' object has no attribute 'int'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/bin/pganonymize", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/pganonymizer/__main__.py", line 12, in main
main(args)
File "/usr/local/lib/python3.10/site-packages/pganonymizer/cli.py", line 79, in main
anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose, dry_run=args.dry_run)
File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 43, in anonymize_tables
build_and_then_import_data(connection, table_name, primary_key, columns, excludes,
File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 94, in build_and_then_import_data
import_data(connection, temp_table, [primary_key] + column_names, filter(None, data))
File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 173, in import_data
mgr.copy([[escape_str_replace(val) for col, val in row.items()] for row in data])
File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 294, in copy
self.writestream(data, datastream)
File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 322, in writestream
f, d = formatter(val)
File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 135, in <lambda>
return lambda v: ('i', (-1,)) if v is None else formatter(v)
File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 213, in f
errors.raise_from(ValueError, message, exc)
File "/usr/local/lib/python3.10/site-packages/pgcopy/errors/py3.py", line 9, in raise_from
raise exccls(message) from exc
ValueError: error formatting value 16cfc1fb-16fc-4888-b6a7-3638698df7ae for column id
ERROR: 1
The value received by pgcopy is this string 16cfc1fb-16fc-4888-b6a7-3638698df7ae
, when it's expecting a instance of class uuid.UUID
.
This is happening with this as schema.yml:
tables:
- users:
fields:
- password:
provider:
name: mask
sign: '?'
For context:
$ pip freeze
Faker==9.8.0
parmap==1.5.3
pganonymize==0.6.1
pgcopy==1.5.0
psycopg2==2.9.1
python-dateutil==2.8.2
pytz==2021.3
PyYAML==6.0
six==1.16.0
text-unidecode==1.3
tqdm==4.62.3
$ postgres --version
postgres (PostgreSQL) 11.8
$ psql -d my_db -c "\d users"
Table "public.users"
Column | Type | Collation | Nullable | Default
--------------------------+-----------------------------+-----------+----------+-------------------------
id | uuid | | not null |
...
Please let me know if some more information is needed, or if I missed some info from the documentation 😅
Due to the urgency most parts of the project are untestet. Therefore a lot of unittest should be added, mostly for the utils.py package.
... due to the use of raw parallelization.
To prove it, run the following:
import parmap
from pganonymizer.providers import FakeProvider
provider = FakeProvider(name='fake.unique.user_name')
def gen_values(qty=100):
base_values = [f'v{n}' for n in range(qty)]
parallel_values = parmap.map(provider.alter_value, base_values)
serial_values = parmap.map(provider.alter_value, base_values, pm_parallel=False)
return parallel_values, serial_values
pvals, svals = gen_values()
# verify uniqueness
print(len(set(pvals)), len(set(svals)))
on my machine, when run it I get the following values:
$ python test.py
16 100
If it worked, it should have been 100 100
I have a strange error:
pganonymizer.exceptions.BadDataFormat: invalid input syntax for type json
DETAIL: Token "'" is invalid.
CONTEXT: JSON data, line 1: {'...
COPY source, line 29, column ui_settings: "{'firstTime': True}"
YAML file:
tables:
- accounts:
fields:
- name:
provider:
name: fake.name
- email:
provider:
name: fake.email
- phone:
provider:
name: fake.phone_number
- title:
provider:
name: choice
values:
- "Mr"
- "Mrs"
- "Dr"
- "Prof"
- "Ms"
truncate:
- django_session
ui_settings column values:
{"firstTime": true, "licenseBannerHasBeenShown": true}
{"firstTime": true}
{}
What am I doing wrong?
Hi,
Is it possible to use faker’s localized providers?
I need to use this one for example: https://faker.readthedocs.io/en/master/locales/fr_FR.html#faker.providers.ssn.fr_FR.Provider
Regards
When attempting to use the create_database_dump
utility on a legacy python2.7 system, the subprocess command fails because run
doesn't exist in the subprocess module bundled with Python 2.7.
https://docs.python.org/3.5/library/subprocess.html#older-high-level-api
I believe the equivalent is call
instead of run
.
Relevant traceback:
Traceback (most recent call last):
File "/Users/brett/.virtualenvs/iris/bin/pganonymize", line 8, in <module>
sys.exit(main())
File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/__main__.py", line 12, in main
main(args)
File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/cli.py", line 89, in main
create_database_dump(args.dump_file, pg_args)
File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/utils.py", line 274, in create_database_dump
subprocess.run(cmd, shell=True)
AttributeError: 'module' object has no attribute 'run'
It would be nice to have a commandline argument that lists all available providers, e.g.:
$ pganonymize --list-providers
choice - Provider that returns a random value from a list of choices.
clear - Provider to set a field value to None.
fake - Provider to generate fake data.
...
After anonymizing a database it would be nice to be able to create a PostgreSQL dump file for further usage. The format could be hard coded for the first version (e.g. bzip2 compressed). Example usage:
$ pganonymize --schema=my_schema.yml \
--host=localhost \
--user=user \
--password=password \
--dbname=database \
--dump-file=my_anonymized_database.bz2
During the exclude of rows, if a column returns None value then the row[None] raises Type error
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/pganonymize", line 11, in <module>
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/__main__.py", line 10, in main
main()
File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/cli.py", line 71, in main
anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 38, in anonymize_tables
data, table_columns = build_data(connection, table_name, columns, excludes, total_count, verbose)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 68, in build_data
if not row_matches_excludes(row, excludes):
File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 101, in row_matches_excludes
if pattern.match(row[column]):
TypeError: expected string or bytes-like object
Ref: myschema.yml
tables:
- res_partner:
fields:
- name:
provider:
name: fake.name
- email:
provider:
name: fake.email
excludes:
- email:
- "info.*@example.com"
Expected Behaviour :
Thanks for the great tool!
It would be nice if this is compatible with dbtoyaml: https://pyrseas.readthedocs.io/en/latest/
Currently, to make the yaml compatible, it is required some transformation that could be avoided.
It would be great to have compatibility with https://www.postgresql.org/docs/current/ddl-generated-columns.html.
At the moment, when I name a generated column in the yml file, I get the following error:
- members:
primary_key: uuid
chunk_size: 5000
fields:
- name:
provider:
name: md5
- firstname:
provider:
name: fake.first_name
- lastname:
provider:
name: fake.last_name
psycopg2.errors.GeneratedAlways: column "name" can only be updated to DEFAULT
DETAIL: Column "name" is a generated column.
This is of course, because generated columns can only be updated in a certain way.
Maybe a generated
provider would be a nice addition?
Hi,
Your project's name conflicts with PostgreSQL Anonymizer by @daamien from Dalibo. It seems that your project started 2 years after @daamien's one.
What do you think of this ?
Anyway, this project is very interesting !
Regards,
Because of the ending Python 2.7 support, most of the images used for testing have dropped the Python 2.7 interpreter (and so setting up Python within the actions, see actions/setup-python#672). This leads to failing tests and breaks the testing chain for other Python versions. As our company still uses Python 2.7 for productive environments this project still needs to support Python 2.7.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.