Giter VIP home page Giter VIP logo

pganonymize's People

Contributors

abhinavvaidya avatar bobslee avatar hkage avatar koptelovav avatar korsar182 avatar nurikk avatar rheinwerk-mp avatar tilley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pganonymize's Issues

Script fails to run

Version: pganonymize-0.5.0

INFO: Found table definition "users"
Anonymizing |████████████████████████████████| 5656/5656
Traceback (most recent call last):
  File "/usr/local/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/__main__.py", line 10, in main
    main()
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 43, in anonymize_tables
    import_data(connection, column_dict, table_name, table_columns, primary_key, data)
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 154, in import_data
    copy_from(connection, data, temp_table, table_columns)
  File "/usr/local/lib/python3.8/dist-packages/pganonymizer/utils.py", line 132, in copy_from
    cursor.copy_from(new_data, table, sep=COPY_DB_DELIMITER, null='\\N', columns=quoted_cols)
psycopg2.errors.UndefinedColumn: column ""id"" of relation "tmp_users" does not exist

@hkage

Better schema validation

If a schema has an invalid structure, the resulting errors are a bit confusing. E.g. if the "tables" definition is missing, the resulting error looks like this:

Traceback (most recent call last):
  File "/home/henning/.local/share/virtualenvs/postgresql-anonymizer-AUSqld0C/bin/pganonymize", line 11, in <module>
    load_entry_point('pganonymize', 'console_scripts', 'pganonymize')()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/cli.py", line 32, in main
    truncate_tables(connection, schema.get('truncate', []))
AttributeError: 'list' object has no attribute 'get'

It would ne nice to have a validation method that checks for the basic structure of the YAML file:

  • The tables keyword
  • fields at the table level
  • A provider at the field level, with at least a name
  • ...

Instead of writing own methods for the validation it could be possible to use general Python based YAML validation libraries, e.g.:

Script fails when a table column as upper case characters

Script fails when a table column as upper case characters

Traceback (most recent call last):
  File "/usr/local/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/__main__.py", line 10, in main
    main()
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 39, in anonymize_tables
    import_data(connection, column_dict, table_name, table_columns, primary_key, data)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 140, in import_data
    copy_from(connection, data, 'source', table_columns)
  File "/usr/local/lib/python3.9/site-packages/pganonymizer/utils.py", line 119, in copy_from
    cursor.copy_from(new_data, table, sep=COPY_DB_DELIMITER, null='\\N', columns=columns)
psycopg2.errors.UndefinedColumn: column "createdat" of relation "source" does not exist

parmap no longer supports Python 2.7

parmap release 1.5.3 dropped support for older versions of Python.

Relevant changelog entry:

parmap (1.5.3)

  • Drop support for unsupported python versions
  • Add support for python 3.10
  • Use tqdm.auto to have nice progress bars on jupyter notebooks (#26)
  • Add dummy _number_left for parallel async (#23)

Downgrading my parmap to 1.5.2 does fix the issue on older versions of Python.

ValueError when copying a table with id of type uuid4

Hey,

I get this error when trying to anonymize any table which primary key id uses uuid4:

INFO: Found table definition "users"
1216it [00:00, 38715.87it/s]                                                                   
Processing 1 batches for users:   0%|                           | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 210, in f
    return formatter(v)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 109, in uuid_formatter
    return 'i2Q', (16, (guid.int >> 64) & MAX_INT64, guid.int & MAX_INT64)
AttributeError: 'str' object has no attribute 'int'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/__main__.py", line 12, in main
    main(args)
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/cli.py", line 79, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose, dry_run=args.dry_run)
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 43, in anonymize_tables
    build_and_then_import_data(connection, table_name, primary_key, columns, excludes,
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 94, in build_and_then_import_data
    import_data(connection, temp_table, [primary_key] + column_names, filter(None, data))
  File "/usr/local/lib/python3.10/site-packages/pganonymizer/utils.py", line 173, in import_data
    mgr.copy([[escape_str_replace(val) for col, val in row.items()] for row in data])
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 294, in copy
    self.writestream(data, datastream)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 322, in writestream
    f, d = formatter(val)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 135, in <lambda>
    return lambda v: ('i', (-1,)) if v is None else formatter(v)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/copy.py", line 213, in f
    errors.raise_from(ValueError, message, exc)
  File "/usr/local/lib/python3.10/site-packages/pgcopy/errors/py3.py", line 9, in raise_from
    raise exccls(message) from exc
ValueError: error formatting value 16cfc1fb-16fc-4888-b6a7-3638698df7ae for column id
ERROR: 1

The value received by pgcopy is this string 16cfc1fb-16fc-4888-b6a7-3638698df7ae, when it's expecting a instance of class uuid.UUID.

This is happening with this as schema.yml:

tables:
 - users:
    fields:
     - password:
        provider:
          name: mask
          sign: '?'

For context:

$ pip freeze
Faker==9.8.0
parmap==1.5.3
pganonymize==0.6.1
pgcopy==1.5.0
psycopg2==2.9.1
python-dateutil==2.8.2
pytz==2021.3
PyYAML==6.0
six==1.16.0
text-unidecode==1.3
tqdm==4.62.3

$ postgres --version
postgres (PostgreSQL) 11.8

$ psql -d my_db -c "\d users"

                                      Table "public.users"
          Column          |            Type             | Collation | Nullable |         Default         
--------------------------+-----------------------------+-----------+----------+-------------------------
 id                       | uuid                        |           | not null | 
 ...

Please let me know if some more information is needed, or if I missed some info from the documentation 😅

Test completion

Due to the urgency most parts of the project are untestet. Therefore a lot of unittest should be added, mostly for the utils.py package.

Using 'faker.unique.xx' as provider doesn't ensure uniqueness of the values...

... due to the use of raw parallelization.

To prove it, run the following:

import parmap
from pganonymizer.providers import FakeProvider


provider = FakeProvider(name='fake.unique.user_name')


def gen_values(qty=100):
    base_values = [f'v{n}' for n in range(qty)]
    parallel_values = parmap.map(provider.alter_value, base_values)
    serial_values = parmap.map(provider.alter_value, base_values, pm_parallel=False)
    return parallel_values, serial_values


pvals, svals = gen_values()

# verify uniqueness
print(len(set(pvals)), len(set(svals)))

on my machine, when run it I get the following values:

$ python test.py
16 100

If it worked, it should have been 100 100

Anonymizing error if there is a JSONB column in a table

I have a strange error:

pganonymizer.exceptions.BadDataFormat: invalid input syntax for type json
DETAIL:  Token "'" is invalid.
CONTEXT:  JSON data, line 1: {'...
COPY source, line 29, column ui_settings: "{'firstTime': True}"

YAML file:

tables:
  - accounts:
      fields:
        - name:
            provider:
              name: fake.name
        - email:
            provider:
              name: fake.email
        - phone:
            provider:
              name: fake.phone_number
        - title:
            provider:
              name: choice
              values:
                - "Mr"
                - "Mrs"
                - "Dr"
                - "Prof"
                - "Ms"

truncate:
  - django_session

ui_settings column values:
{"firstTime": true, "licenseBannerHasBeenShown": true}
{"firstTime": true}
{}

What am I doing wrong?

Subprocess "run" being used on Python2.7

When attempting to use the create_database_dump utility on a legacy python2.7 system, the subprocess command fails because run doesn't exist in the subprocess module bundled with Python 2.7.

https://docs.python.org/3.5/library/subprocess.html#older-high-level-api

I believe the equivalent is call instead of run.

Relevant traceback:

Traceback (most recent call last):
  File "/Users/brett/.virtualenvs/iris/bin/pganonymize", line 8, in <module>
    sys.exit(main())
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/__main__.py", line 12, in main
    main(args)
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/cli.py", line 89, in main
    create_database_dump(args.dump_file, pg_args)
  File "/Users/brett/.virtualenvs/iris/lib/python2.7/site-packages/pganonymizer/utils.py", line 274, in create_database_dump
    subprocess.run(cmd, shell=True)
AttributeError: 'module' object has no attribute 'run'

Commandline argument to list available providers

It would be nice to have a commandline argument that lists all available providers, e.g.:

$ pganonymize --list-providers

choice - Provider that returns a random value from a list of choices.
clear - Provider to set a field value to None.
fake - Provider to generate fake data.
...

Commandline argument to create a database dump

After anonymizing a database it would be nice to be able to create a PostgreSQL dump file for further usage. The format could be hard coded for the first version (e.g. bzip2 compressed). Example usage:

$ pganonymize --schema=my_schema.yml \
    --host=localhost \
    --user=user \
    --password=password \
    --dbname=database \
    --dump-file=my_anonymized_database.bz2

During Exclude if the Result is "None" then TypeError is Raised

During the exclude of rows, if a column returns None value then the row[None] raises Type error

https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/5f6d7b3e1a9f4ae22e843eb1c6d57314a1939936/pganonymizer/utils.py#L101

Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/pganonymize", line 11, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/__main__.py", line 10, in main
    main()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 38, in anonymize_tables
    data, table_columns = build_data(connection, table_name, columns, excludes, total_count, verbose)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 68, in build_data
    if not row_matches_excludes(row, excludes):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pganonymizer/utils.py", line 101, in row_matches_excludes
    if pattern.match(row[column]):
TypeError: expected string or bytes-like object

Ref: myschema.yml

tables:
 - res_partner:
    fields:
     - name:
        provider:
          name: fake.name
     - email:
        provider:
          name: fake.email
    excludes: 
     - email:
        - "info.*@example.com"

Expected Behaviour :

  • To ignore such records as it does not match the pattern any way

compatibility with "GENERATED ALWAYS" columns

It would be great to have compatibility with https://www.postgresql.org/docs/current/ddl-generated-columns.html.
At the moment, when I name a generated column in the yml file, I get the following error:

- members:
    primary_key: uuid
    chunk_size: 5000
    fields:
     - name:
        provider:
          name: md5
     - firstname:
        provider:
          name: fake.first_name
     - lastname:
        provider:
          name: fake.last_name

psycopg2.errors.GeneratedAlways: column "name" can only be updated to DEFAULT
DETAIL:  Column "name" is a generated column.

This is of course, because generated columns can only be updated in a certain way.
Maybe a generated provider would be a nice addition?

Python 2.7 tests are failing

Because of the ending Python 2.7 support, most of the images used for testing have dropped the Python 2.7 interpreter (and so setting up Python within the actions, see actions/setup-python#672). This leads to failing tests and breaks the testing chain for other Python versions. As our company still uses Python 2.7 for productive environments this project still needs to support Python 2.7.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.