Giter VIP home page Giter VIP logo

Comments (12)

heatherbooker avatar heatherbooker commented on July 21, 2024

Update: if I move the two all_tables lines to after the source_dbc exists, it gets mad if any field is missing from the config file (which it didn't do with the direct subset script). By gets mad I mean

Traceback (most recent call last):
  File "iterative_subset.py", line 66, in <module>
    lower_limit, lower_limit_norm, upper_limit, upper_limit_norm = compute_fast_limits()
  File "iterative_subset.py", line 40, in compute_fast_limits
    desired_result = config_reader.get_target_percent()
  File "/Users/heather.booker/dev/condenser/config_reader.py", line 27, in get_target_percent
    return _config['desired_result']['percent']
KeyError: 'desired_result'

And similar with excluded_tables.

Maybe this is intended and I would have experienced it later if the direct_subset script had gotten further? I'm not sure. ><

from condenser.

acolombi avatar acolombi commented on July 21, 2024

I just pushed a bunch of updates. Maybe give it another whirl after pulling the latest? Of note: We deprecated iterative_subset.py. I think our best bet is to debug your issue with direct_subset.py.

So what's the problem you're seeing with direct_subset.py? If you can send a copy of your schema, that might help us diagnose more quickly. The easiest way to do that is with pg_dump, https://www.postgresql.org/docs/9.6/app-pgdump.html. The command you would use looks like,

$ pg_dump -h <host> -U <user> -p <port> -d <database> --schema-only -f <output_file>.sql

If you don't want to paste the schema here I can give you an email address to send it too. Thanks.

from condenser.

heatherbooker avatar heatherbooker commented on July 21, 2024

Ah ok cool. I think the reason it wasn't working for me was that I failed to read the "Known Issues" section with

Only works with bigint, non-compound primary and foreign keys.

:)

Any plans in the future for that to change?

from condenser.

acolombi avatar acolombi commented on July 21, 2024

Is it the compound part of the int part that's a challenge for you? (or, I guess, both?)

from condenser.

heatherbooker avatar heatherbooker commented on July 21, 2024

Probably both but definitely at least the compound part!

from condenser.

heatherbooker avatar heatherbooker commented on July 21, 2024

But should I be able to run it ignoring the tables involving those keys? Like

# config.json
{
    "passthrough_threshold": 50,
    "excluded_tables": [
        {"schema": "public","tables": [ "nos", "nopes", "no_nopes"] }
    ],
    "passthrough_tables": [
        {"schema": "public","tables": [ "nos", "nopes", "no_nopes"] }
    ],
    "desired_result": { "percent": 5, "schema": "public", "table": "potatoes" },
 }

With the above, I expect it to ignore nos nopes and no_nopes, but it still chokes on them with Exception: Creating tables failed., and in the create_error.txt:

psql:/me/condenser/SQL/dump_create.sql:5788:
ERROR: column "nopes.none_of_this" must appear in the GROUP BY clause
       or be used in an aggregate function
LINE 3:     nopes.none_of_this,

from condenser.

acolombi avatar acolombi commented on July 21, 2024

Yes, excluded_tables should ignore the tables, if possible. It might be the case that the tables with compound keys are required, for example if you subset on table user, which has a FK to address, and address using a compound PK. In that scenario you can't take any of the user table without also subsetting address, and excluding the table would fail. (Though one way you can handle that is to specify that relationship in the dependency_breaks, e.g. [{"parent": "public.user", "child": "public.addess"}].

I just pushed some changes that might address the issue you're seeing. It's hard to know without a better idea of what's in your schema. Any chance you can share it?

Otherwise my guess is that it has something to do with using a VIEW that contains a GROUP BY in your schema. This was an issue we addressed in our commercial offering so I just ported those changes over.

from condenser.

acolombi avatar acolombi commented on July 21, 2024

Hi @heatherbooker , have you had a chance to check out the new version? Thanks.

from condenser.

heatherbooker avatar heatherbooker commented on July 21, 2024

Hi @acolombi , sorry for the hold up. Unfortunately I can't share the schema, sorry! If I try the new version, it seems that the config format changed to accept target but the readme and example.config.json did not, just a heads up!

If I combine the schema and table keys in my config file with target, it gets farther now! It hits

Beginning subsetting with these direct targets:
...
Beginning greedy downstream subsetting with these tables:
...

And then gets trapped on

Traceback (most recent call last):
  File "/me/condenser/subset.py", line 121, in __subset_greedily
    database_helper.run_query('INSERT INTO "{}"."{}" {}'.format(schema_name(target), table_name(target), query), destination_conn)
  File "/me/condenser/database_helper.py", line 87, in run_query
    cur.execute(query)
psycopg2.ProgrammingError: operator does not exist: integer = character varying
LINE 1: ...long_table_name"."foreign_key_type_integer" IN (SELECT...
                                                             ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "direct_subset.py", line 30, in <module>
    s.run_middle_out()
  File "/me/condenser/subset.py", line 81, in run_middle_out
    self.__subset_greedily(t, processed_tables, relationships)
  File "/me/condenser/subset.py", line 126, in __subset_greedily
    database_helper.run_query('DROP TABLE IF EXISTS "{}"."{}"'.format(self.temp_schema, temp_target_name), destination_conn)
  File "/me/condenser/database_helper.py", line 87, in run_query
    cur.execute(query)

In any case, it's very possible that there are some dependencies that I should be breaking but am not, but I'm probably not going to end up going through and finding them. I'm not sure if the above is helpful to you but I really appreciate the help and fixes you've been providing through this!

from condenser.

acolombi avatar acolombi commented on July 21, 2024

Ah yes, thanks for the feedback. The README and example have been updated.

It should give a pretty clear error message when there are unbroken circular dependencies. It'll say something like:

$ python direct_subset.py 
Traceback (most recent call last):
  File "direct_subset.py", line 30, in <module>
    s.run_middle_out()
  File "./condenser/subset.py", line 59, in run_middle_out
    order = get_topological_order_by_tables(relationships, connected_tables)
  File "./condenser/topo_orderer.py", line 6, in get_topological_order_by_tables
    return list(toposort(topsort_input))
  File "~/.pyenv/versions/p35/lib/python3.5/site-packages/toposort.py", line 81, in toposort
    raise CircularDependencyError(data)
toposort.CircularDependencyError: Circular dependencies exist among these items: {'public.events':{'public.users'}, 'public.ignore_me':{'public.users'}, 'public.places':{'sub.regions'}, 'public.users':{'sub.regions'}, 'sub.regions':{'public.users'}}

So I think the error is not a circular dependency you're forgetting to break...

Honestly tho I don't know what it is. Tracing through the code I'm not sure what would cause an error like the one you're experiencing. Is there any chance you can share just the schema surrounding the table and foreign key identified in the exception you got?

Another option is you could try our commercial product, which also features subsetting (among other abilities, including data masking and data synthesis), but has a more robust implementation. (This Python code is mostly a playground for trying ideas out.)

Thanks.

from condenser.

heatherbooker avatar heatherbooker commented on July 21, 2024

Ah yeah, I was thinking about the dependencies in terms of excluding tables with compound keys in terms of what you mentioned earlier

It might be the case that the tables with compound keys are required, for example if you subset on table user, which has a FK to address, and address using a compound PK. In that scenario you can't take any of the user table without also subsetting address, and excluding the table would fail. (Though one way you can handle that is to specify that relationship in the dependency_breaks, e.g. [{"parent": "public.user", "child": "public.addess"}].

Actually now that I think about it, for this line in the output, the first part isn't actually a table name - it seems to be something prepended to it, ie the table is called long_table_name but it shows up as ...mp_long_table_name:

psycopg2.ProgrammingError: operator does not exist: integer = character varying
LINE 1: ...mp_long_table_name"."foreign_key_type_integer" IN (SELECT...

As for the schema, long_table_name does have an index (not the foreign_key_type_integer shown in the error output) UNIQUE CONSTRAINT, btree (foreign_key_type_integer, other_fk_to_different_table_also_type_integer) - so maybe that has to do with what it's catching on?

from condenser.

heatherbooker avatar heatherbooker commented on July 21, 2024

Anyway I'm gonna close since I don't want either of us to sink too much time into this, but thanks so much for your help!!

from condenser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.