kensho-technologies / graphql-compiler Goto Github PK

View Code? Open in Web Editor NEW

550.0 550.0 51.0 6.5 MB

Turn complex GraphQL queries into optimized database queries.

License: Apache License 2.0

Python 99.85% Shell 0.15%

compiler database graphql graphql-query orientdb python sql

graphql-compiler's People

Contributors

Stargazers

Watchers

graphql-compiler's Issues

Add support for using `@fold` inside an `@optional` vertex field or traversal from one

We currently disallow using @fold directives when either directly within, or in a traversal from, an @optional vertex field. We might be able to add support for this with some work, if there is interest for it.

Validate field names to ensure vertex fields are correctly named in the schema

Ensure that the schema satisfies the following two properties:

If a field's name starts with out_ or in_, it must be a vertex field.
All non-root vertex fields must start with either out_ or in_.

Replace SqlBackend with SQLAlchemy Dialect references

Right now the user is supplying the correct backend string, like 'postgresql', to tell the compiler what backend to compile to. Instead, the user should set things up with a sqlalchemy dialect, which is less prone to issues with passing in an incorrect string, and is naturally accessible with most SQLAlchemy setups.

Right now, compiler setup looks like

backend = 'postgresql'
sql_metadata = SqlMetadata(backend, sqlalchemy_metadata)
compile_graphql_to_sql(..., sql_metadata)

Instead, the dialect corresponding to the backend can be passed in as

from sqlalchemy.dialects import postgresql
sql_metadata = SqlMetadata(postgresql, sqlalchemy_metadata)
compile_graphql_to_sql(..., sql_metadata)

Using @tag and then @filter with that tag in the same scope might not work correctly

Also unlikely to work, and should throw good errors: if @tag appears on the exact same property where the @filter is applied.

GraphQLDecimal not implemented

We will need to create a new scalar type to support this.

graphql-python/graphql-core#24

Manage relationship through a directive

I was wondering if relationship could be managed through a directive such as @relationship gestalt's one.
To me, the use of a directive seems more user friendly as it does not force any naming convention of field labels, but - honestly - I may be missing some points.

Add support for pipelined visitor functions

This would allow for a meta-visitor that applies the required visitors in the provided order. This separates the initialization from the steps of the visitors, which is a little more readable.

This could look something like:

visitor_fn_1 = ...
visitor_fn_2 = ...
visitor_fn_3 = ...

visitor_fn = pipeline(
     visitor_fn_1,
     visitor_fn_2,
     visitor_fn_3,
)
block.visit_and_update_expressions(visitor_fn)

as a replacement to

visitor_fn_1 = ...
block.visit_and_update_expressions(visitor_fn_1)
visitor_fn_2 = ...
block.visit_and_update_expressions(visitor_fn_2)
visitor_fn_3 = ...
block.visit_and_update_expressions(visitor_fn_3)

Clarify that @fold will output empty lists if the vertex field doesn't exist

Add documentation for the SQL backend.

Problem with Query Definition in Schema

Your current test schema looks like this:

type RootSchemaQuery {
        Animal: Animal
        BirthEvent: BirthEvent
        Entity: Entity
        Event: Event
        FeedingEvent: FeedingEvent
        Food: Food
        FoodOrSpecies: FoodOrSpecies
        Location: Location
        Species: Species
        UniquelyIdentifiable: UniquelyIdentifiable
    }

If I understood GraphQL and the queries correctly, you should return an [Animal] instead of a single Animal. But this causes some problems within your compiler.

Add Support for filtering based on size of a list/set

If a class has a field that is a list, I would like to do something like


{
    Animal {
        nicknames @filter(op_name: "length", value: ["$something"])
    }
}

Allow traversing from within a @fold scope

Allow queries like:

{
    Species {
        in_Animal_OfSpecies @fold {
            out_Entity_Related {
                name @output(out_name: "related_list")
            }
        }
    }
}

We currently don't support traversals from inside a @fold scope.

Add Python 3 support

Add support for the "_x_count" meta-field to the Gremlin compiler backend

The Gremlin backend does not currently support the _x_count meta-field, per #158.

Add support for optional edges to the SQL backend.

The rule of thumb for mapping an edge to a JOIN statement is that if the edge is required, an INNER JOIN should be used, and if the edge is optional a LEFT JOIN should be used. This applies to all tables involved in both direct and many-to-many JOINs, with one notable exception.

When an edge is required within an optional scope, the compiler semantics state that if the
outer optional edge is present, but the inner required edge is not, this result should be
excluded. For example with the GraphQL query:

{
    Animal {
        name @output(out_name: "name")
        out_Animal_ParentOf @optional {
            name @output(out_name: "child_name")
            out_Animal_ParentOf {
                name @output(out_name: "grandchild_name")
            }
        }
    }
}

An animal that has a child (satisfying the first optional ParentOf edge), but where that child has no children (failing to satisfy the second required ParentOf edge) should produce no result. Using nested INNER JOINs here from the outer LEFT JOIN, like

SELECT
    animal.name as name,
    child.name as child_name,
    grandchild.name as grandchild_name
FROM animal
LEFT JOIN (
    animal AS child
    INNER JOIN (
        animal as grandchild
    ) ON child.parentof_id = grandchild.animal_id
) ON animal AS child ON animal.parentof_id = child.animal_id

will have a NULL value returned for the grandchild.name property. The LEFT JOIN condition is fulfilled but the INNER JOIN condition is not, which doesn't exclude the result but rather includes it with a NULL value.

To get the correct semantics, the result when the INNER JOIN condition is not fulfilled needs to be filtered out. This is done explicitly by replacing the INNER JOIN with a LEFT JOIN, and then applying the JOIN condition in the WHERE clause to the rows that are non-null from the LEFT JOIN. For this example this looks like:

SELECT
    animal.name as name,
    child.name as child_name,
    grandchild.name as grandchild_name
FROM animal
LEFT JOIN (
    animal AS child
    INNER JOIN (
        animal as grandchild
    ) ON child.parentof_id = grandchild.animal_id
) ON animal AS child ON animal.parentof_id = child.animal_id
WHERE
    child.animal_id IS NULL
    OR
    child.parentof_id = grandchild.animal_id -- reapply JOIN condition in WHERE clause

The null check ensures that the filter is only applied iff the LEFT JOIN condition is actually
satisfied.

Paging / streaming mechanism for large queries

Once we have query cardinality estimation set up, it should be possible to extend that system to offer a paging / streaming mechanism that avoids overwhelming clients with large result sets all at once.

Folding on an abstract edge class does not work correctly in Gremlin

Somewhat of a related cause to #156: the generated Gremlin code assumes that the edge data is stored at a field named to correspond to the edge type in question. However, if the edge class is abstract, this is not the case.

Resolving this issue might be challenging and may require a lot of work, since Gremlin (for the most part) is not aware of the database schema and inheritance structure.

Double-check that @output_source on a scope with a type coercion works

Raise errors if unused @tag directives are found

Having a @tag directive whose value is never used is semantically wrong since the directive could and should simply be removed. This should throw an error, but currently does not.

Add AgensGraph support

Could be interesting to extend support for http://www.agensgraph.com/

Using OrientDB + pyorient doesn't return "null" values -- the key is just missing

Support Huawei Cloud Service (Gremlin)

I am looking for support of Huawei Graph Engine Service (cloud), which supports "pure" Gremlin, I need to ask first if they or not customizing their own dialect. I think this shouldn't be that hard, i can start with driver change.

Q: What is difference in OrientDB Gremlin "dialect" vs pure Gremlin? Or what should I do to support pure Gremlin?
Q: How do you construct typical GraphQL schema for Graph? I was even thinking to create one super big root GraphQL schema covering all labels, properties, etc.

Any help from real world experience welcomed.

Thank you.

Filtering with "has_edge_degree" on an abstract edge class does not work correctly in MATCH

Assume the following schema:

type Foo {
    name: String
    out_ParentEdgeType: [Foo]
    out_ChildEdgeType: [Foo]
}

where the ParentEdgeType is an abstract superclass of the ChildEdgeType.

The following query then gets incorrectly compiled in MATCH:

{
    Foo {
        name @output(out_name: "name")
        out_ParentEdgeType @optional @filter(op_name: "has_edge_degree", value: ["$degree"]) {
            name
        }
    }
}

The issue is that the has_edge_degree assumes that the edge is stored as a field named out_ParentEdgeType on the Foo vertex. However, this is not always the case -- if the edge is actually of type ChildEdgeType (subclass of ParentEdgeType), it will instead be stored in the out_ChildEdgeType field on the Foo vertex.

A possible resolution would be to switch to using the outE() operator instead, which correctly accounts for inheritance.

NotImplementedError when calling toGremlin in FoldedContextField

Hey guys!

First of all, thanks a lot for this awesome work. I was testing the compiler in combination with Gremlin. The following GraphQL is mentioned in your Readme, but causes a NotImplementedError when trying to generate a Gremlin statement out of it:

Animal {
        name @output(out_name: "name")
        out_Animal_ParentOf @fold {
            _x_count @filter(op_name: ">=", value: ["$min_children"])
                    @output(out_name: "number_of_children")
            name @filter(op_name: "has_substring", value: ["$substr"])
                 @output(out_name: "child_names")
        }
    }

Is it a bug or is it just not implemented.

Many thanks!

Support querying data stored on edges

We currently support only querying data stored on vertices. Edges are used only as a means to get from one vertex to another.

However, data could also in principle be stored on edges. This is something we should consider supporting.

For compiling to MATCH, the following OrientDB issue is a blocker: orientechnologies/orientdb#7802

Add plugin support for custom filtering operations

Support for multiple type coercion in a single scope

Currently, it is not possible to do the following:

{
    Animal {
        name @output(out_name: "animal_name")
        out_Entity_Related {
            ... Species {
                name @output(out_name: "animal_species_name")
            }
            ... Animal {
                name @output(out_name: "related_animal_name")
            }
        }
    }
}

Maybe there's a way to implement it.

Implement @recurse directive for SQL backend

To match the semantics of the GraphQL compiler, recursive common table expressions (CTEs) are required. SQL backends are good at pushing predicates down into subqueries and CTEs, however this does not generally extend to recursive CTEs. This means that it is very easy to write a recursive CTE that will scan an entire table, even if all but a few starting points of that recursion are eventually discarded later.

Using the query

{
    Animal {
        name @output(out_name: "animal_name")
             @filter(op_name: "in_collection", value: ["$names"])
        out_Animal_LivesIn @optional {
            name @output(out_name: "location_name")
        }
        out_Animal_ParentOf @recurse(depth: 2) {
            name @output(out_name: "animal_or_descendant_name")
        }
    }
}

as an example, this is addressed with the following algorithm:

Recursively create the query, treating recursive edges as a black box. For this example, this results in the rough SQL:

SELECT
    animal.name AS animal_name,
    location.name AS location_name
FROM
    animal
LEFT JOIN animal_livesin ON animal_livesin.animal_id = animal.animal_id
LEFT JOIN location ON location.location_id = animal_livesin.livesin_id
WHERE
    animal.name IN :names

Wrap this query as a CTE, and include any link columns in the output. A link column is the column that the recursive clause will later be attached to.

WITH base_cte AS ( -- the actual name of the CTE is an anonymous table name
    SELECT
        animal.name as animal_name,
        location.name as location_name
        animal.animal_id as link_column -- the actual name of the column an anonymous column name
    FROM
        animal
    LEFT JOIN animal_livesin ON animal_livesin.animal_id = animal.animal_id
    LEFT JOIN location ON location.location_id = animal_livesin.livesin_id
    WHERE
        animal.name IN :names
)

Construct the recursive clause. Here we only recurse on the columns necessary to JOIN before and after the recursion, output columns are not carried along. The recursion is joined to the CTE of the base query, ensuring that the recursion only starts at the required starting points, no more.
Also worth noting with the recursive clause is the __depth_internal_name, which keeps track of recursion depth per the compiler's semantics.

WITH RECURSIVE recursive_cte AS (
    -- anchor query, starts with trivial semantics with each animal as it's own parent
    SELECT
        base_cte.link_column AS animal_id,
        base_cte.link_column AS parentof_id,
        0 AS __depth_internal_name
    FROM
        base_cte
    UNION ALL
    -- recursive query
    SELECT
        recursive_cte.animal_id,
        animal_parentof.parentof_id,
        -- increment the depth
        recursive_cte.__depth_internal_name + 1 AS __depth_internal_name
    FROM
        animal_parentof
        JOIN recursive_cte ON recursive_cte.parentof_id = animal_parentof.animal_id
    WHERE
        recursive_cte.__depth_internal_name < :depth -- depth from recurse directive
)

JOIN the recursive clause to the recursive table (here animal_parentof) to create output columns, and join back to base cte to carry along tag columns.

WITH recursive_cte_outputs AS (
    SELECT
        animal.name AS animal_or_descendant_name,
        anon_3.animal_id AS recursive_link_column -- anonymously aliased column
    FROM
        recursive_cte
        JOIN animal on animal.animal_id = recursive_cte.parentof_id
        JOIN base_cte ON recursive_cte.animal_id = base_cte.link_column
)

Create the final query

SELECT
    base_cte.animal_name,
    base_cte.location_name,
    recursive_cte_outputs.animal_or_descendant_name
FROM
    base_cte
JOIN
    recursive_cte_outputs ON base_cte.link_column = recursive_cte_outputs.recursive_link_column

TypeError caused by BinaryComposition with None sub-expression

Compiling the following legal GraphQL query fails with the TypeError below. It seems that a BinaryComposition object is somehow constructed with a None as one of the sub-expressions.

{
    Animal {
        name @output(out_name: "animal_name")
        uuid @filter(op_name: "between", value: ["$uuid_lower_bound","$uuid_upper_bound"])

        in_Animal_ParentOf @optional
                           @filter(op_name: "has_edge_degree", value: ["$number_of_edges"]) {
            out_Entity_Related {
                ... on Event {
                    name @output(out_name: "related_event")
                }
            }
        }
    }
}

Error:

graphql_compiler/compiler/common.py:50: in compile_graphql_to_match
    schema, graphql_string, type_equivalence_hints)
graphql_compiler/compiler/common.py:94: in _compile_graphql_generic
    type_equivalence_hints=type_equivalence_hints)
graphql_compiler/compiler/ir_lowering_match/__init__.py:121: in lower_ir
    compound_match_query)
graphql_compiler/compiler/ir_lowering_match/optional_traversal.py:563: in lower_context_field_expressions
    match_traversals, current_visitor_fn)
graphql_compiler/compiler/ir_lowering_match/optional_traversal.py:532: in _lower_non_existent_context_field_filters
    new_filter = step.where_block.visit_and_update_expressions(visitor_fn)
graphql_compiler/compiler/blocks.py:174: in visit_and_update_expressions
    new_predicate = self.predicate.visit_and_update(visitor_fn)
graphql_compiler/compiler/expressions.py:676: in visit_and_update
    new_left = self.left.visit_and_update(visitor_fn)
graphql_compiler/compiler/expressions.py:680: in visit_and_update
    return visitor_fn(BinaryComposition(self.operator, new_left, new_right))
graphql_compiler/compiler/expressions.py:660: in __init__
    self.validate()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = BinaryComposition(('&&', None, BinaryComposition(('||', BinaryComposition(('&&', BinaryComposition(('=', Variable(('$n...rentOf',)))), Variable(('$number_of_edges', <graphql.type.definition.GraphQLScalarType object at 0x1056361a8>))))))))))

    def validate(self):
        """Validate that the BinaryComposition is correctly representable."""
        _validate_operator_name(self.operator, BinaryComposition.SUPPORTED_OPERATORS)
    
        if not isinstance(self.left, Expression):
            raise TypeError(u'Expected Expression left, got: {} {}'.format(
>               type(self.left).__name__, self.left))
E           TypeError: Expected Expression left, got: NoneType None

graphql_compiler/compiler/expressions.py:668: TypeError

Thanks to @kaleagore for the report.

Raise errors if there are no @outputs inside a @fold scope

Not sure if this is already the case -- verify and fix if needed.

Add support for required edges to the SQL backend.

Prohibit using @fold inside an @fold scope

The code might not enforce this at the moment. Fix it and add a test.

Mutation support

Are you planning to support graphql mutations?

Snapshot tests' OrientDB schema is out of sync relative to the test GraphQL schema

The current OrientDB schema used for snapshot tests is missing fields and classes that exist in the test GraphQL schema.

Compare:
https://github.com/kensho-technologies/graphql-compiler/blob/c85429d2a56fc5856522429643ce79cce25efda2/graphql_compiler/tests/test_data_tools/schema.sql
vs

graphql-compiler/graphql_compiler/tests/test_helpers.py

Line 86 in c85429d

schema_text = '''

We should bring them back into sync, and add a test to make sure that they don't diverge again.

Improve error message when a meta-field is queried but not built-in or present in the schema

Verify that passed GraphQL schema has all the necessary types and directives

When compiling, we accept a GraphQL schema parameter. We should validate that the schema passed this way includes all the required scalar types and has all the required directives.

Add support to auto-gen GraphQL schema from reflected SQL database tables

This can use a constructed SQLAlchemy MetaData object to construct the GraphQL schema from the table objects in the metadata. These tables themselves can be automatically reflected from the database. See https://docs.sqlalchemy.org/en/latest/core/metadata.html for a little background.

Usage of Variables

In the current examples, all variables are encapsulated with "". According to the definition, this is wrong. This is not critical, as it works anyway. But if you try to integrate the compiler into 3rd party libraries, this might become an issue.

Instead of using:

{
    Animal {
        name @output(out_name: "animal_name")
        color @filter(op_name: "=", value: ["$animal_color"])
    }
}

You should write:

query($animal_color: String!) {
    Animal {
        name @output(out_name: "animal_name")
        color @filter(op_name: "=", value: [$animal_color])
    }
}

Custom meta field __count is breaking schema parsing in GraphQL.js

It appears that the Python port of GraphQL.js is less strict about enforcing the "no double-underscored fields in the schema" policy than the GraphQL.js library itself. As a result, the schemas generated by the newest version of the compiler cannot be parsed by the original Javascript GraphQL library.

This is a very unfortunate problem. Sadly, I think the least painful solution would be to rename our __count field to something like _x_count, signifying that it's an extension field via the _x_ prefix. Single-underscored fields are allowed to appear in the schema, so this should address the problem for users relying on non-Python GraphQL libraries.

Unfortunately, this will be a breaking change for the GraphQL compiler, and any queries that rely on __count will have to be changed to use _x_count instead.

cc @jmeulemans @lodrion

8 functions have McCabe complexity > 10

$ flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

./graphql_compiler/compiler/compiler_frontend.py:383:1: C901 '_compile_vertex_ast' is too complex (17)
./graphql_compiler/compiler/ir_lowering_common.py:17:1: C901 'sanity_check_ir_blocks_from_frontend' is too complex (34)
./graphql_compiler/compiler/ir_lowering_common.py:213:1: C901 'optimize_boolean_expression_comparisons' is too complex (13)
./graphql_compiler/compiler/ir_lowering_match.py:284:1: C901 '_translate_equivalent_locations' is too complex (11)
./graphql_compiler/compiler/match_query.py:33:1: C901 '_per_location_tuple_to_step' is too complex (12)
./graphql_compiler/compiler/workarounds/orientdb_eval_scheduling.py:32:1: C901 '_process_filter_block' is too complex (11)
./graphql_compiler/query_formatting/gremlin_formatting.py:83:1: C901 '_safe_gremlin_argument' is too complex (11)
./graphql_compiler/query_formatting/match_formatting.py:69:1: C901 '_safe_match_argument' is too complex (11)

Allow traversals from within a vertex marked @optional

Currently, doing queries like the following is not allowed:

{
    Animal {
        out_Animal_ParentOf @optional {
            out_Animal_FedAt {
                name @output(out_name: "name")
            }
        }
    }
}

This is because we have no way to traverse out of an optional vertex in MATCH, so this issue is a blocker: orientechnologies/orientdb#7803

Add GraphQL schema auto-generation from the existing database schema

Implement required edges for the SQL backend

This is blocked by #212. This feature allows for any number of required edges to be compiled by the SQL backend.

Using from Java

Hi guys,

I am thinking how to utilize this compiler within Java. One idea is that I will create a generator which will create all possible query combinations and output for each combination a query in some Java consumable form and then fill all these queries into Java app.

What do you think, or is there better way?

Field out_Animal_RelatedTo not existing

There is an example in your Readme using a out_Animal_RelatedTo field which is not existing according to your schema.

Unable to resolve dependencies for pipenv lock

Off a clean master branch, running pipenv lock throws error:

Locking [dev-packages] dependencies...

Warning: Your dependencies could not be resolved. You likely have a mismatch in your sub-dependencies.                                                                                                   
  You can use $ pipenv install --skip-lock to bypass this mechanism, then run $ pipenv graph to inspect the situation.                                                                                   
  Hint: try $ pipenv lock --pre if it is a pre-release dependency.
Could not find a version that matches pluggy<0.7,>=0.5,>=0.7
Tried: 0.3.0, 0.3.0, 0.3.1, 0.3.1, 0.4.0, 0.4.0, 0.5.0, 0.5.1, 0.5.1, 0.5.2, 0.5.2, 0.6.0, 0.6.0, 0.6.0, 0.7.1, 0.7.1, 0.8.0, 0.8.0                                                                      
There are incompatible versions in the resolved dependencies.```

Allow using multiple @filter directives on the same field

@optional that doesn't exist as the only output generates empty rows

Maybe detect that all outputs are optional, and filter out empty-only rows.

Example affected query:

{
  Animal {
    out_Animal_ParentOf @optional {
      name @output(out_name: "child_name")
    }
  }
}

Animals with no offspring will still return rows, but their data will be empty.

Allow "virtual" edges to be defined, and expanded using an AST-based macro system

While normalized data representations are great for data quality and cleanliness, they often get in the way of ease of use, data discoverability, and navigation through the database.

Using the Animals schema in the compiler's tests as an example, it would be much easier to find a given animal's grandparents if an out_Animal_Grandparent edge existed. However, this edge is simply a two-fold traversal of the existing in_Animal_ParentOf edge; adding a out_Animal_Grandparent edge would denormalize the schema and would cause difficulties in maintaining the data.

Instead of adding such an edge to the database, we could define a macro that would be expanded by the GraphQL compiler before query compilation. That way, users can submit a query that relies on the out_Animal_Grandparent edge, and the compiler can use the macro system to rewrite that query into an equivalent query that relies only on existing schema elements.

Implement List-valued columns for the PostgreSQL backend

Currently, when required edges are introduced to the SQL backend, the name_or_alias filter will now be run against the SQL backend. For this to succeed, there needs to be a SQL backend that supports the List valued alias field. Postgres is ideal with its native list type.

This issue requires:

A test to be introduced that applies the name_or_alias filter to the root
Modification to the SQL test harness that introduces the alias filter only on test backends that
support it (postgres)
Changing the default test dialect to Postgres from SQLite, so that compiled query tests have a
full featured backend.

kensho-technologies / graphql-compiler Goto Github PK

graphql-compiler's People

Contributors

Stargazers

Watchers

Forkers

graphql-compiler's Issues

Recommend Projects

Recommend Topics

Recommend Org