Giter VIP home page Giter VIP logo

Comments (6)

Tmonster avatar Tmonster commented on June 24, 2024 1

Hi @fisher-liquid,

Thank you for filing the report. I took a look at this and I think this is expected behavior, it's just that the result size is extremely large. Here are the logical plans for the two queries you mentioned

original query

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          item_id          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      RIGHT_DELIM_JOIN     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           INNER           │
│delim_index IS NOT DISTINCT├───────────────────────────────────────────┐
│      FROM delim_index     │                                           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                                           │
│         EC: 10364         │                                           │
└─────────────┬─────────────┘                                           │
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
│         PROJECTION        │                             │         HASH_JOIN         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             #0            │                             │           INNER           │
│             #1            │                             │delim_index IS NOT DISTINCT├───────────────────────────────────────────┐
│             #2            │                             │      FROM delim_index     │                                           │
│             #3            │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                                           │
│                           │                             │         EC: 10364         │                                           │
└─────────────┬─────────────┘                             └─────────────┬─────────────┘                                           │
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
│      STREAMING_WINDOW     │                             │         HASH_JOIN         │                             │         DUMMY_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │                           │
│        delim_index        │                             │           INNER           │                             │                           │
│                           │                             │     item_id = item_id     ├──────────────┐              │                           │
│                           │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │              │                           │
│                           │                             │           EC: 1           │              │              │                           │
└─────────────┬─────────────┘                             └─────────────┬─────────────┘              │              └───────────────────────────┘
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         HASH_JOIN         │                             │         SEQ_SCAN          ││       INOUT_FUNCTION      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││                           │
│           INNER           │                             │             t3            ││                           │
│          id = id          │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││                           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ├──────────────┐              │          item_id          ││                           │
│        Build Min: 0       │              │              │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││                           │
│      Build Max: 9999      │              │              │         EC: 750000        ││                           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │              │                           ││                           │
│         EC: 10364         │              │              │                           ││                           │
└─────────────┬─────────────┘              │              └───────────────────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │                             │         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t2            ││             t1            │                             │   COALESCE(item_ids, [])  │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │        delim_index        │
│             id            ││             id            │                             │          item_ids         │
│          item_ids         ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │                           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││         EC: 10000         │                             │                           │
│         EC: 10000         ││                           │                             │                           │
└───────────────────────────┘└───────────────────────────┘                             └─────────────┬─────────────┘
                                                                                       ┌─────────────┴─────────────┐
                                                                                       │         DELIM_SCAN        │
                                                                                       └───────────────────────────┘

CTE

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          item_id          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           INNER           │
│     item_id = item_id     ├──────────────┐
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │
│         EC: 11229         │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         HASH_JOIN         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t3            ││           INNER           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││          id = id          │
│          item_id          ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ├──────────────┐
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││        Build Min: 0       │              │
│         EC: 750000        ││      Build Max: 9999      │              │
│                           ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │
│                           ││         EC: 10364         │              │
└───────────────────────────┘└─────────────┬─────────────┘              │
                             ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                             │         PROJECTION        ││         SEQ_SCAN          │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │             id            ││             t1            │
                             │          item_id          ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │                           ││             id            │
                             │                           ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │                           ││         EC: 10000         │
                             └─────────────┬─────────────┘└───────────────────────────┘
                             ┌─────────────┴─────────────┐
                             │           UNNEST          │
                             └─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │         SEQ_SCAN          │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │             t2            │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │             id            │
                             │          item_ids         │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │         EC: 10000         │
                             └───────────────────────────┘

Note that the CTE is actually not the same query since there is an extra join condition. This is the main reason the CTE finished so fast (more on that later). An equivalent query with a CTE would looks like the following

explain with id_item as (
      select id, unnest(item_ids) as item_id
      from t2
  )
  SELECT *
  FROM t1
  INNER JOIN t2 on t2.id = t1.id, 
  id_item
  INNER JOIN t3 on t3.item_id = id_item.item_id;

The query plan for this query is

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             id            │
│             id            │
│          item_ids         │
│             id            │
│          item_id          │
│          item_id          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       CROSS_PRODUCT       ├───────────────────────────────────────────┐
└─────────────┬─────────────┘                                           │
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
│         HASH_JOIN         │                             │         HASH_JOIN         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           INNER           │                             │           INNER           │
│     item_id = item_id     │                             │          id = id          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ├──────────────┐              │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ├──────────────┐
│         EC: 10834         │              │              │        Build Min: 0       │              │
│                           │              │              │      Build Max: 9999      │              │
│                           │              │              │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │
│                           │              │              │         EC: 10364         │              │
└─────────────┬─────────────┘              │              └─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         PROJECTION        ││         SEQ_SCAN          ││         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t3            ││             id            ││             t2            ││             t1            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││          item_id          ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          item_id          ││                           ││             id            ││             id            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││                           ││          item_ids         ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         EC: 750000        ││                           ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││         EC: 10000         │
│                           ││                           ││         EC: 10000         ││                           │
└───────────────────────────┘└─────────────┬─────────────┘└───────────────────────────┘└───────────────────────────┘
                             ┌─────────────┴─────────────┐
                             │           UNNEST          │
                             └─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │         SEQ_SCAN          │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │             t2            │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │             id            │
                             │          item_ids         │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │         EC: 10000         │
                             └───────────────────────────┘

This query also hangs and does not finish (or will get killed by the OS).
I decided to make intermediate tables from the left and right subtrees before the cross produce with the following queries. This way I could estimate the size of the cross product myself.

create table cp_right as (SELECT * FROM t1 INNER JOIN t2 on t2.id = t1.id);
create table cp_left as with id_item as (select id, unnest(item_ids) as item_id from t2) SELECT * from id_item INNER JOIN t3 on t3.item_id = id_item.item_id;
select count(*) from cp_right;
-- 10000
select count(*) from cp_left;
-- 20000000

This means the last cross product will end up producing 200,000,000,000 results. If item_id is just a 32 bit integer, the result set has a size of 800 gigabytes. Using the CTE with the extra join condition, only the size of the left table is returned as cp_right only has distinct item_id values, so only 20,000,000 results, which comes out to 0.08 Gigabytes.

Let me know if this helps. If you still have issues with a use case surrounding this, let me know 👍

from duckdb.

soerenwolfers avatar soerenwolfers commented on June 24, 2024 1

@Tmonster The query plans should be equivalent though. They might not be equal, but they should produce equal results; they do in PostgreSQL at least.

See
https://www.db-fiddle.com/f/4xiYbrbGgbCsbGT4dJzHmB/0

where the two queries from the OP produce equal results, and only your CTE produces the "big" result.

In summary, it seems to me the issue reported by the OP is indeed not a performance issue but, worse, a parsing/compilation issue.

from duckdb.

Tmonster avatar Tmonster commented on June 24, 2024 1

Ok yea, so this is an issue with our Parser most likely. For some reason DuckDB detects a correlated subquery when in reality this is not a correlated subquery.

It will be fixed, but not before v1.0.0 I'm afraid. If you decide to stick to CTE's instead of the subquery solution mentioned in (#11820 (comment)), you can actually run the CTE as a materialized CTE. This way it is a run once, use multiple times

I would leave this issue open though, that way I can remember to come back to it.

from duckdb.

fisher-liquid avatar fisher-liquid commented on June 24, 2024

Thanks for the detailed write up! I'll stick to unnesting arrays in CTEs in the future instead of cross joining them.

from duckdb.

Tmonster avatar Tmonster commented on June 24, 2024

TL;DR Ah yes, technically this isn't a cross product, sorry for the confusion. My explanation for why isn't great, but it has to do with correlated subqueries and the unnest call. I can spend some more time refining it, but my advice is to stick with CTE's in this case, or the following query

SELECT t3.item_id
  FROM t1
  INNER JOIN t2 ON t2.id = t1.id
  INNER JOIN
  (select id, UNNEST(COALESCE(item_ids, [])) item_id from t2) u ON u.id = t1.id
  INNER JOIN t3 on t3.item_id = u.item_id;

Ah ok yea, technically the two should be equivalent, sorry for the confusion. In the query

SELECT t3.item_id
FROM t1
INNER JOIN t2 ON t2.id = t1.id,
UNNEST(COALESCE(t2.item_ids, [])) u(item_id)
INNER JOIN t3 on t3.item_id = u.item_id;

the unnest(coalesce(t2.item_ids,[])) u(item_id) refers to the same read of table t2 from the first inner join, so it isn't really a cross product. For every value in the join between t1 and t2, another column needs to get added that unnests t2.item_ids. While this is easy to explain, our planner/optimizer has no statistics about this unnest, so it duplicates the intermediate table, unnests t2.item_ids in the duplicate and joins it with t3. Then the planner joins the result back to the original join of t1&t2 using an implicit join on the rowid-values of the unnested values of t2.item_ids. This plan is very confusion/bad and is why the query takes so long to execute. The CTE is much easier to optimize for since t2 is read twice, so there is no confusion about when/where to unnest the item_ids in the query plan.

from duckdb.

soerenwolfers avatar soerenwolfers commented on June 24, 2024

The problem with "recommend sticking to CTEs" is that (1) it's hard to know in what situations that recommendation applies (2) CTEs come with their own performance problems (e.g. they can't be "run once and use results many times" afaik).

Without having understood the details in your answer, do you agree there is a bug in the current parser/optimiser and do you think that'll be fixed?

Or do you think duckdb will define its interpretation of the OP's query as an accepted deviation from the PostgreSQL dialect?

from duckdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.