What happens? A cross join to unnest fairly large array fields (20

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Cross join to unnest array never completes,about duckdb/duckdb

Comments (6)

Tmonster commented on June 24, 2024 1

Thank you for filing the report. I took a look at this and I think this is expected behavior, it's just that the result size is extremely large. Here are the logical plans for the two queries you mentioned

original query

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          item_id          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      RIGHT_DELIM_JOIN     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           INNER           │
│delim_index IS NOT DISTINCT├───────────────────────────────────────────┐
│      FROM delim_index     │                                           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                                           │
│         EC: 10364         │                                           │
└─────────────┬─────────────┘                                           │
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
│         PROJECTION        │                             │         HASH_JOIN         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             #0            │                             │           INNER           │
│             #1            │                             │delim_index IS NOT DISTINCT├───────────────────────────────────────────┐
│             #2            │                             │      FROM delim_index     │                                           │
│             #3            │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                                           │
│                           │                             │         EC: 10364         │                                           │
└─────────────┬─────────────┘                             └─────────────┬─────────────┘                                           │
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
│      STREAMING_WINDOW     │                             │         HASH_JOIN         │                             │         DUMMY_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │                           │
│        delim_index        │                             │           INNER           │                             │                           │
│                           │                             │     item_id = item_id     ├──────────────┐              │                           │
│                           │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │              │                           │
│                           │                             │           EC: 1           │              │              │                           │
└─────────────┬─────────────┘                             └─────────────┬─────────────┘              │              └───────────────────────────┘
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         HASH_JOIN         │                             │         SEQ_SCAN          ││       INOUT_FUNCTION      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││                           │
│           INNER           │                             │             t3            ││                           │
│          id = id          │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││                           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ├──────────────┐              │          item_id          ││                           │
│        Build Min: 0       │              │              │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││                           │
│      Build Max: 9999      │              │              │         EC: 750000        ││                           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │              │                           ││                           │
│         EC: 10364         │              │              │                           ││                           │
└─────────────┬─────────────┘              │              └───────────────────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │                             │         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t2            ││             t1            │                             │   COALESCE(item_ids, [])  │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │        delim_index        │
│             id            ││             id            │                             │          item_ids         │
│          item_ids         ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │                           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││         EC: 10000         │                             │                           │
│         EC: 10000         ││                           │                             │                           │
└───────────────────────────┘└───────────────────────────┘                             └─────────────┬─────────────┘
                                                                                       ┌─────────────┴─────────────┐
                                                                                       │         DELIM_SCAN        │
                                                                                       └───────────────────────────┘

CTE

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          item_id          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           INNER           │
│     item_id = item_id     ├──────────────┐
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │
│         EC: 11229         │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         HASH_JOIN         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t3            ││           INNER           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││          id = id          │
│          item_id          ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ├──────────────┐
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││        Build Min: 0       │              │
│         EC: 750000        ││      Build Max: 9999      │              │
│                           ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │
│                           ││         EC: 10364         │              │
└───────────────────────────┘└─────────────┬─────────────┘              │
                             ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                             │         PROJECTION        ││         SEQ_SCAN          │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │             id            ││             t1            │
                             │          item_id          ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │                           ││             id            │
                             │                           ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │                           ││         EC: 10000         │
                             └─────────────┬─────────────┘└───────────────────────────┘
                             ┌─────────────┴─────────────┐
                             │           UNNEST          │
                             └─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │         SEQ_SCAN          │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │             t2            │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │             id            │
                             │          item_ids         │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │         EC: 10000         │
                             └───────────────────────────┘

Note that the CTE is actually not the same query since there is an extra join condition. This is the main reason the CTE finished so fast (more on that later). An equivalent query with a CTE would looks like the following

explain with id_item as (
      select id, unnest(item_ids) as item_id
      from t2
  )
  SELECT *
  FROM t1
  INNER JOIN t2 on t2.id = t1.id, 
  id_item
  INNER JOIN t3 on t3.item_id = id_item.item_id;

The query plan for this query is

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             id            │
│             id            │
│          item_ids         │
│             id            │
│          item_id          │
│          item_id          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       CROSS_PRODUCT       ├───────────────────────────────────────────┐
└─────────────┬─────────────┘                                           │
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
│         HASH_JOIN         │                             │         HASH_JOIN         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           INNER           │                             │           INNER           │
│     item_id = item_id     │                             │          id = id          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ├──────────────┐              │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ├──────────────┐
│         EC: 10834         │              │              │        Build Min: 0       │              │
│                           │              │              │      Build Max: 9999      │              │
│                           │              │              │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │
│                           │              │              │         EC: 10364         │              │
└─────────────┬─────────────┘              │              └─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         PROJECTION        ││         SEQ_SCAN          ││         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t3            ││             id            ││             t2            ││             t1            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││          item_id          ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          item_id          ││                           ││             id            ││             id            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││                           ││          item_ids         ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         EC: 750000        ││                           ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││         EC: 10000         │
│                           ││                           ││         EC: 10000         ││                           │
└───────────────────────────┘└─────────────┬─────────────┘└───────────────────────────┘└───────────────────────────┘
                             ┌─────────────┴─────────────┐
                             │           UNNEST          │
                             └─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │         SEQ_SCAN          │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │             t2            │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │             id            │
                             │          item_ids         │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │         EC: 10000         │
                             └───────────────────────────┘

This query also hangs and does not finish (or will get killed by the OS).
I decided to make intermediate tables from the left and right subtrees before the cross produce with the following queries. This way I could estimate the size of the cross product myself.

create table cp_right as (SELECT * FROM t1 INNER JOIN t2 on t2.id = t1.id);
create table cp_left as with id_item as (select id, unnest(item_ids) as item_id from t2) SELECT * from id_item INNER JOIN t3 on t3.item_id = id_item.item_id;
select count(*) from cp_right;
-- 10000
select count(*) from cp_left;
-- 20000000

This means the last cross product will end up producing 200,000,000,000 results. If item_id is just a 32 bit integer, the result set has a size of 800 gigabytes. Using the CTE with the extra join condition, only the size of the left table is returned as cp_right only has distinct item_id values, so only 20,000,000 results, which comes out to 0.08 Gigabytes.

Let me know if this helps. If you still have issues with a use case surrounding this, let me know 👍

from duckdb.

soerenwolfers commented on June 24, 2024 1

@Tmonster The query plans should be equivalent though. They might not be equal, but they should produce equal results; they do in PostgreSQL at least.

See
https://www.db-fiddle.com/f/4xiYbrbGgbCsbGT4dJzHmB/0

where the two queries from the OP produce equal results, and only your CTE produces the "big" result.

In summary, it seems to me the issue reported by the OP is indeed not a performance issue but, worse, a parsing/compilation issue.

from duckdb.

Tmonster commented on June 24, 2024 1

Ok yea, so this is an issue with our Parser most likely. For some reason DuckDB detects a correlated subquery when in reality this is not a correlated subquery.

It will be fixed, but not before v1.0.0 I'm afraid. If you decide to stick to CTE's instead of the subquery solution mentioned in (#11820 (comment)), you can actually run the CTE as a materialized CTE. This way it is a run once, use multiple times

I would leave this issue open though, that way I can remember to come back to it.

from duckdb.

fisher-liquid commented on June 24, 2024

Thanks for the detailed write up! I'll stick to unnesting arrays in CTEs in the future instead of cross joining them.

from duckdb.

Tmonster commented on June 24, 2024

TL;DR Ah yes, technically this isn't a cross product, sorry for the confusion. My explanation for why isn't great, but it has to do with correlated subqueries and the unnest call. I can spend some more time refining it, but my advice is to stick with CTE's in this case, or the following query

SELECT t3.item_id
  FROM t1
  INNER JOIN t2 ON t2.id = t1.id
  INNER JOIN
  (select id, UNNEST(COALESCE(item_ids, [])) item_id from t2) u ON u.id = t1.id
  INNER JOIN t3 on t3.item_id = u.item_id;

Ah ok yea, technically the two should be equivalent, sorry for the confusion. In the query

SELECT t3.item_id
FROM t1
INNER JOIN t2 ON t2.id = t1.id,
UNNEST(COALESCE(t2.item_ids, [])) u(item_id)
INNER JOIN t3 on t3.item_id = u.item_id;

the unnest(coalesce(t2.item_ids,[])) u(item_id) refers to the same read of table t2 from the first inner join, so it isn't really a cross product. For every value in the join between t1 and t2, another column needs to get added that unnests t2.item_ids. While this is easy to explain, our planner/optimizer has no statistics about this unnest, so it duplicates the intermediate table, unnests t2.item_ids in the duplicate and joins it with t3. Then the planner joins the result back to the original join of t1&t2 using an implicit join on the rowid-values of the unnested values of t2.item_ids. This plan is very confusion/bad and is why the query takes so long to execute. The CTE is much easier to optimize for since t2 is read twice, so there is no confusion about when/where to unnest the item_ids in the query plan.

from duckdb.

soerenwolfers commented on June 24, 2024

The problem with "recommend sticking to CTEs" is that (1) it's hard to know in what situations that recommendation applies (2) CTEs come with their own performance problems (e.g. they can't be "run once and use results many times" afaik).

Without having understood the details in your answer, do you agree there is a bug in the current parser/optimiser and do you think that'll be fixed?

Or do you think duckdb will define its interpretation of the OP's query as an accepted deviation from the PostgreSQL dialect?

from duckdb.

Cross join to unnest array never completes about duckdb HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent