Giter VIP home page Giter VIP logo

Comments (7)

blathers-crl avatar blathers-crl commented on August 26, 2024

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/sql-queries (found keywords: optimizer,vectorized,plan)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

πŸ¦‰ Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

from cockroach.

morphace avatar morphace commented on August 26, 2024

I think the "real" issue might not be primarily the fact that it's doing FULL SCANs but that the rendering of the JSON seems to be performed on the unfiltered result of the joins...

from cockroach.

michae2 avatar michae2 commented on August 26, 2024

Hi @morphace, thanks for opening an issue. It will be difficult to diagnose the problem without the ability to reproduce it. Could you either (a) boil it down to the simplest reproduction steps, including CREATE TABLE and INSERT statements needed to populate the data to reproduce, or (b) open a support ticket with a statement diagnostics bundle?

from cockroach.

morphace avatar morphace commented on August 26, 2024

Hi @michae2,

This is the most simple example I could create:

drop table if exists outer_table, outer_related_table, inner_table, inner_related_table;
create table outer_table
(
    id    string primary key,
    value int
);
create table outer_related_table
(
    id    string primary key,
    outer_sibling_id string,
    value int
);
create table inner_table
(
    id    string primary key,
    parent_id string,
    value int
);
create table inner_related_table
(
    id    string primary key,
    inner_sibling_id string,
    value int
);
create index idx_value on outer_table (value);
create index idx_outer_sibling_id on outer_related_table (outer_sibling_id);
create index idx_parent_id on inner_table (parent_id);
create index idx_inner_sibling_id on inner_related_table (inner_sibling_id);
alter table inner_table
    add constraint fk_inner_outer foreign key (parent_id) references outer_table (id);
alter table outer_related_table
    add constraint fk_inner_outer_sibling foreign key (outer_sibling_id) references outer_table (id);
alter table inner_related_table
    add constraint fk_inner_inner_sibling foreign key (inner_sibling_id) references inner_table (id);
insert into outer_table (id, value)
(select uuid_v4(), random() * 10000 from generate_series(1, 100000));
insert into outer_related_table (id, outer_sibling_id, value)
(select uuid_v4(), id, random() * 10000 from outer_table);
insert into inner_table (id, parent_id, value)
    (select uuid_v4(), id, random() * 10000 from outer_table);
insert into inner_table (id, parent_id, value)
    (select uuid_v4(), id, random() * 10000 from outer_table);
insert into inner_table (id, parent_id, value)
    (select uuid_v4(), id, random() * 10000 from outer_table);
insert into inner_related_table (id, inner_sibling_id, value)
    (select uuid_v4(), id, random() * 10000 from inner_table);
analyze outer_table;
analyze outer_related_table;
analyze inner_table;
analyze inner_related_table;

The query:

explain select
    (select json_agg(json_build_object('val', i.value))
     from inner_table i
     left join inner_related_table irt on irt.inner_sibling_id=i.id
     where parent_id=o.id)
from outer_table o
left join outer_related_table ort on ort.outer_sibling_id=o.id
where o.value > 5000

from cockroach.

morphace avatar morphace commented on August 26, 2024

When you look at the plan below, you'll see that the nested query is doing those full scans and probably rendering the JSON for all rows...?

distribution: local
vectorized: true

β€’ render
β”‚
└── β€’ group (hash)
    β”‚ estimated row count: 50,577"
    β”‚ group by: rownum
    β”‚
    └── β€’ hash join (right outer)
        β”‚ estimated row count: 153,174"
        β”‚ equality: (parent_id) = (id)
        β”‚
        β”œβ”€β”€ β€’ render
        β”‚   β”‚
        β”‚   └── β€’ merge join (left outer)
        β”‚       β”‚ estimated row count: 300,000"
        β”‚       β”‚ equality: (id) = (inner_sibling_id)
        β”‚       β”‚ left cols are key
        β”‚       β”‚
        β”‚       β”œβ”€β”€ β€’ scan
        β”‚       β”‚     estimated row count: 300,000 (100% of the table; stats collected 1 minute ago)"
        β”‚       β”‚     table: inner_table@inner_table_pkey
        β”‚       β”‚     spans: FULL SCAN
        β”‚       β”‚
        β”‚       └── β€’ scan
        β”‚             estimated row count: 300,000 (100% of the table; stats collected 1 minute ago)"
        β”‚             table: inner_related_table@idx_inner_sibling_id
        β”‚             spans: FULL SCAN
        β”‚
        └── β€’ ordinality
            β”‚ estimated row count: 50,577"
            β”‚
            └── β€’ hash join (right outer)
                β”‚ estimated row count: 50,577"
                β”‚ equality: (outer_sibling_id) = (id)
                β”‚ right cols are key
                β”‚
                β”œβ”€β”€ β€’ scan
                β”‚     estimated row count: 100,000 (100% of the table; stats collected 2 minutes ago)"
                β”‚     table: outer_related_table@idx_outer_sibling_id
                β”‚     spans: FULL SCAN
                β”‚
                └── β€’ scan
                      estimated row count: 50,101 (50% of the table; stats collected 2 minutes ago)"
                      table: outer_table@idx_value
                      spans: [/5001 - ]

from cockroach.

michae2 avatar michae2 commented on August 26, 2024

Thank you for the repro, @morphace. I've boiled it down a little further to:

CREATE TABLE ab (a INT PRIMARY KEY, b INT);
CREATE TABLE cd (c INT PRIMARY KEY, d INT);
CREATE TABLE ef (e INT PRIMARY KEY, f INT);

-- able to plan lookup joins
EXPLAIN SELECT (SELECT json_agg(f) FROM cd@{NO_FULL_SCAN} LEFT JOIN ef@{NO_FULL_SCAN} ON e = d WHERE c = b) FROM ab WHERE a > 5;

-- change json_agg to json_object_agg and we cannot produce query plan without FTS
EXPLAIN SELECT (SELECT json_object_agg('val', f) FROM cd@{NO_FULL_SCAN} LEFT JOIN ef@{NO_FULL_SCAN} ON e = d WHERE c = b) FROM ab WHERE a > 5;

-- remove the left join and we can plan a lookup join
EXPLAIN SELECT (SELECT json_object_agg('val', d) FROM cd@{NO_FULL_SCAN} WHERE c = b) FROM ab WHERE a > 5;

from cockroach.

mgartner avatar mgartner commented on August 26, 2024

In order to get a plan with two lookup-joins, we need to reorder the joins to be: (Join (Join ab cd) ef). This enables a plan where we scan ab first, then use the values in column b to lookup rows of cd where c = b, then use the values in column d to lookup rows of ef where e = d.

In the json_object_agg case, the 'val' literal requires a projection expression above the join of cd and ef, so the expression looks like (Join (Project (Join cd ef)) ab). The optimizer won't reorder these two joins because they are not adjacentβ€”due to the Project. Since the joins are not reordered, the lookup joins cannot be explored.

If we change the 'val' literal to a column reference, the projection is omitted, and we get the desired plan:

CREATE TABLE ab (a INT PRIMARY KEY, b INT);
CREATE TABLE cd (c INT PRIMARY KEY, d INT);
CREATE TABLE ef (e INT PRIMARY KEY, f INT, s STRING);

EXPLAIN
SELECT (
  SELECT json_object_agg(s, f)
  FROM cd@{NO_FULL_SCAN}
  LEFT JOIN ef@{NO_FULL_SCAN}
  ON e = d WHERE c = b
)
FROM ab WHERE a > 5;
--                    info
-- ------------------------------------------
--   distribution: local
--   vectorized: true
-- 
--   β€’ render
--   β”‚
--   └── β€’ group (streaming)
--       β”‚ group by: a
--       β”‚ ordered: +a
--       β”‚
--       └── β€’ lookup join (left outer)
--           β”‚ table: ef@ef_pkey
--           β”‚ equality: (d) = (e)
--           β”‚ equality cols are key
--           β”‚
--           └── β€’ lookup join (left outer)
--               β”‚ table: cd@cd_pkey
--               β”‚ equality: (b) = (c)
--               β”‚ equality cols are key
--               β”‚
--               └── β€’ scan
--                     missing stats
--                     table: ab@ab_pkey
--                     spans: [/6 - ]
-- (23 rows)

In the original example, I believe it is other literals, like 'f0', that cause the same problem.

Some potential solutions I can think of are:

  1. Add an exploration rule that pulls Projects that produce a constant value up out of joins when it allows more joins to be adjacent. I'm not sure how generally applicable this is because I think there are likely cases where this would change the semantics of the query plan.
  2. Allow the join reorderer to traverse Projects that produce a constant value. Again, not sure how generally applicable this is.
  3. Could we somehow "curry" the json_object_agg('val', f) function into json_object_agg_curried(f) to eliminate the Project of the literal 'val'? It's an interesting idea, but I don't think it would work in the original example above where the literal values are part of a single argument expression, not the entire argument, e.g., json_build_object('f0', '' || material11_.id, 'f1', '' || material11_.availability_date, ...).

The workaround, for now, is to rewrite the query with the correct join ordering and no subqueries:

CREATE TABLE ab (a INT PRIMARY KEY, b INT);
CREATE TABLE cd (c INT PRIMARY KEY, d INT);
CREATE TABLE ef (e INT PRIMARY KEY, f INT);

EXPLAIN
SELECT json_object_agg('val', f)
FROM ab
LEFT JOIN cd@{NO_FULL_SCAN} ON c = b
LEFT JOIN ef@{NO_FULL_SCAN} ON e = d
WHERE a > 5;
--                    info
-- ------------------------------------------
--   distribution: local
--   vectorized: true
-- 
--   β€’ group (scalar)
--   β”‚
--   └── β€’ render
--       β”‚
--       └── β€’ lookup join (left outer)
--           β”‚ table: ef@ef_pkey
--           β”‚ equality: (d) = (e)
--           β”‚ equality cols are key
--           β”‚
--           └── β€’ lookup join (left outer)
--               β”‚ table: cd@cd_pkey
--               β”‚ equality: (b) = (c)
--               β”‚ equality cols are key
--               β”‚
--               └── β€’ scan
--                     missing stats
--                     table: ab@ab_pkey
--                     spans: [/6 - ]
-- (21 rows)

from cockroach.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.