rheinwerk-verlag / pganonymize Goto Github PK

View Code? Open in Web Editor NEW

41.0 10.0 26.0 748 KB

A commandline tool for anonymizing PostgreSQL databases

Home Page: http://pganonymize.readthedocs.io/

License: Other

Makefile 5.33% Python 94.23% Dockerfile 0.44%

gdpr database anonymization postgresql python python2 python3 command-line-tool dsgvo cli

pganonymize's People

Contributors

Stargazers

Watchers

pganonymize's Issues

Anonymizing error if there is a JSONB column in a table

I have a strange error:

pganonymizer.exceptions.BadDataFormat: invalid input syntax for type json
DETAIL:  Token "'" is invalid.
CONTEXT:  JSON data, line 1: {'...
COPY source, line 29, column ui_settings: "{'firstTime': True}"

YAML file:

tables:
  - accounts:
      fields:
        - name:
            provider:
              name: fake.name
        - email:
            provider:
              name: fake.email
        - phone:
            provider:
              name: fake.phone_number
        - title:
            provider:
              name: choice
              values:
                - "Mr"
                - "Mrs"
                - "Dr"
                - "Prof"
                - "Ms"

truncate:
  - django_session

ui_settings column values:
{"firstTime": true, "licenseBannerHasBeenShown": true}
{"firstTime": true}
{}

What am I doing wrong?

parmap no longer supports Python 2.7

parmap release 1.5.3 dropped support for older versions of Python.

Relevant changelog entry:

parmap (1.5.3)

Drop support for unsupported python versions

Add support for python 3.10

Use tqdm.auto to have nice progress bars on jupyter notebooks (#26)

Add dummy _number_left for parallel async (#23)

Downgrading my parmap to 1.5.2 does fix the issue on older versions of Python.

During Exclude if the Result is "None" then TypeError is Raised

During the exclude of rows, if a column returns None value then the row[None] raises Type error

https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/5f6d7b3e1a9f4ae22e843eb1c6d57314a1939936/pganonymizer/utils.py#L101

Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/pganonymize", line 11, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/__main__.py", line 10, in main
    main()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 38, in anonymize_tables
    data, table_columns = build_data(connection, table_name, columns, excludes, total_count, verbose)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 68, in build_data
    if not row_matches_excludes(row, excludes):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 101, in row_matches_excludes
    if pattern.match(row[column]):
TypeError: expected string or bytes-like object

Ref: myschema.yml

tables:
 - res_partner:
    fields:
     - name:
        provider:
          name: fake.name
     - email:
        provider:
          name: fake.email
    excludes: 
     - email:
        - "info.*@example.com"

Expected Behaviour :

To ignore such records as it does not match the pattern any way

Better schema validation

If a schema has an invalid structure, the resulting errors are a bit confusing. E.g. if the "tables" definition is missing, the resulting error looks like this:

Traceback (most recent call last):
  File "/home/henning/.local/share/virtualenvs/postgresql-anonymizer-AUSqld0C/bin/pganonymize", line 11, in <module>
    load_entry_point('pganonymize', 'console_scripts', 'pganonymize')()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/cli.py", line 32, in main
    truncate_tables(connection, schema.get('truncate', []))
AttributeError: 'list' object has no attribute 'get'

It would ne nice to have a validation method that checks for the basic structure of the YAML file:

The tables keyword
fields at the table level
A provider at the field level, with at least a name
...

Instead of writing own methods for the validation it could be possible to use general Python based YAML validation libraries, e.g.:

https://github.com/pyeve/cerberus/
https://github.com/23andMe/Yamale (only Python 3.6+)
https://github.com/keleshev/schema

Test completion

Due to the urgency most parts of the project are untestet. Therefore a lot of unittest should be added, mostly for the utils.py package.

Script fails when a table column as upper case characters

Traceback (most recent call last):
  File "/usr/local/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/__main__.py", line 10, in main
    main()
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 39, in anonymize_tables
    import_data(connection, column_dict, table_name, table_columns, primary_key, data)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 140, in import_data
    copy_from(connection, data, 'source', table_columns)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 119, in copy_from
    cursor.copy_from(new_data, table, sep=COPY_DB_DELIMITER, null='\\N', columns=columns)
psycopg2.errors.UndefinedColumn: column "createdat" of relation "source" does not exist

Commandline argument to list available providers

It would be nice to have a commandline argument that lists all available providers, e.g.:

$ pganonymize --list-providers

choice - Provider that returns a random value from a list of choices.
clear - Provider to set a field value to None.
fake - Provider to generate fake data.
...

Because of the ending Python 2.7 support, most of the CI images used for testing have dropped the Python 2.7 interpreter (and so setting up Python within the actions, see actions/setup-python#672). This leads to failing tests and breaks the testing chain for other Python versions. As our company still uses Python 2.7 for productive environments this project still needs to support Python 2.7.

compatibility with "GENERATED ALWAYS" columns

It would be great to have compatibility with https://www.postgresql.org/docs/current/ddl-generated-columns.html.
At the moment, when I name a generated column in the yml file, I get the following error:

- members:
    primary_key: uuid
    chunk_size: 5000
    fields:
     - name:
        provider:
          name: md5
     - firstname:
        provider:
          name: fake.first_name
     - lastname:
        provider:
          name: fake.last_name

psycopg2.errors.GeneratedAlways: column "name" can only be updated to DEFAULT
DETAIL:  Column "name" is a generated column.

This is of course, because generated columns can only be updated in a certain way.
Maybe a generated provider would be a nice addition?

New release?

Hi, @hkage! Can you please release a new version with latest changes?

ValueError when copying a table with id of type uuid4

Hey,

I get this error when trying to anonymize any table which primary key id uses uuid4:

INFO: Found table definition "users"
1216it [00:00, 38715.87it/s]                                                                   
Processing 1 batches for users:   0%|                           | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 210, in f
    return formatter(v)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 109, in uuid_formatter
    return 'i2Q', (16, (guid.int >> 64) & MAX_INT64, guid.int & MAX_INT64)
AttributeError: 'str' object has no attribute 'int'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/__main__.py", line 12, in main
    main(args)
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/cli.py", line 79, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose, dry_run=args.dry_run)
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 43, in anonymize_tables
    build_and_then_import_data(connection, table_name, primary_key, columns, excludes,
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 94, in build_and_then_import_data
    import_data(connection, temp_table, [primary_key] + column_names, filter(None, data))
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 173, in import_data
    mgr.copy([[escape_str_replace(val) for col, val in row.items()] for row in data])
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 294, in copy
    self.writestream(data, datastream)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 322, in writestream
    f, d = formatter(val)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 135, in <lambda>
    return lambda v: ('i', (-1,)) if v is None else formatter(v)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 213, in f
    errors.raise_from(ValueError, message, exc)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/errors/py3.py", line 9, in raise_from
    raise exccls(message) from exc
ValueError: error formatting value 16cfc1fb-16fc-4888-b6a7-3638698df7ae for column id
ERROR: 1

The value received by pgcopy is this string 16cfc1fb-16fc-4888-b6a7-3638698df7ae, when it's expecting a instance of class uuid.UUID.

This is happening with this as schema.yml:

tables:
 - users:
    fields:
     - password:
        provider:
          name: mask
          sign: '?'

For context:

$ pip freeze
Faker==9.8.0
parmap==1.5.3
pganonymize==0.6.1
pgcopy==1.5.0
psycopg2==2.9.1
python-dateutil==2.8.2
pytz==2021.3
PyYAML==6.0
six==1.16.0
text-unidecode==1.3
tqdm==4.62.3

$ postgres --version
postgres (PostgreSQL) 11.8

$ psql -d my_db -c "\d users"

                                      Table "public.users"
          Column          |            Type             | Collation | Nullable |         Default         
--------------------------+-----------------------------+-----------+----------+-------------------------
 id                       | uuid                        |           | not null | 
 ...

Please let me know if some more information is needed, or if I missed some info from the documentation 😅

Support --format=plain option for pg_dump

I would like to have an option to choose how pg_dump will format the database (i.e. plain vs custom)
I could make a PR to support it, but I need to know
Would you interested to have this kind of option?
What do you think of --dump-format (plain, custom directory, tar) and --dump-compress (0..9) ?
Or do you prefer something more generic to pass any pg_dump options to the command?

How to use faker’s localized providers?

Hi,

Is it possible to use faker’s localized providers?

I need to use this one for example: https://faker.readthedocs.io/en/master/locales/fr_FR.html#faker.providers.ssn.fr_FR.Provider

Regards

Add Trusted Publisher Management workflow

PyPI has introduced the Trusted Publisher Management in order to replace hardcoded API tokens and passwords. The project's publish workflow should be update to match the configuration from PyPI.

Project name conflict

Hi,

Your project's name conflicts with PostgreSQL Anonymizer by @daamien from Dalibo. It seems that your project started 2 years after @daamien's one.

What do you think of this ?

Anyway, this project is very interesting !

Regards,

Subprocess "run" being used on Python2.7

When attempting to use the create_database_dump utility on a legacy python2.7 system, the subprocess command fails because run doesn't exist in the subprocess module bundled with Python 2.7.

https://docs.python.org/3.5/library/subprocess.html#older-high-level-api

I believe the equivalent is call instead of run.

Relevant traceback:

Traceback (most recent call last):
  File "/Users/brett/.virtualenvs/iris/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/__main__.py", line 12, in main
    main(args)
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/cli.py", line 89, in main
    create_database_dump(args.dump_file, pg_args)
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/utils.py", line 274, in create_database_dump
    subprocess.run(cmd, shell=True)
AttributeError: 'module' object has no attribute 'run'

Using 'faker.unique.xx' as provider doesn't ensure uniqueness of the values...

... due to the use of raw parallelization.

To prove it, run the following:

import parmap
from pganonymizer.providers import FakeProvider


provider = FakeProvider(name='fake.unique.user_name')


def gen_values(qty=100):
    base_values = [f'v{n}' for n in range(qty)]
    parallel_values = parmap.map(provider.alter_value, base_values)
    serial_values = parmap.map(provider.alter_value, base_values, pm_parallel=False)
    return parallel_values, serial_values


pvals, svals = gen_values()

# verify uniqueness
print(len(set(pvals)), len(set(svals)))

on my machine, when run it I get the following values:

$ python test.py
16 100

If it worked, it should have been 100 100

Use asynchronous PostgreSQL connection

https://github.com/MagicStack/asyncpg

More an idea than a necessary feature, because asyncpg requires Python 3.5 or later.

Commandline argument to create a database dump

After anonymizing a database it would be nice to be able to create a PostgreSQL dump file for further usage. The format could be hard coded for the first version (e.g. bzip2 compressed). Example usage:

$ pganonymize --schema=my_schema.yml \
    --host=localhost \
    --user=user \
    --password=password \
    --dbname=database \
    --dump-file=my_anonymized_database.bz2

Schema compatibility (dbtoyaml)

Thanks for the great tool!
It would be nice if this is compatible with dbtoyaml: https://pyrseas.readthedocs.io/en/latest/
Currently, to make the yaml compatible, it is required some transformation that could be avoided.

rheinwerk-verlag / pganonymize Goto Github PK

pganonymize's People

Contributors

Stargazers

Watchers

Forkers

pganonymize's Issues

Recommend Projects

Recommend Topics

Recommend Org