What happens? I created a timestamp that (ordered timestamp,trace_

You could have turned this into a MWE like this: <div class="highlight highlight-s

You could have turned this into a MWE like this: <div class="highligh

Performance issues using LEFT JOIN about duckdb HOT 4 OPEN

ASaltedFishBoy commented on September 22, 2024

Performance issues using LEFT JOIN

from duckdb.

Comments (4)

soerenwolfers commented on September 22, 2024 1

"To reproduce" means you are supposed to provide concrete code that reproduces the observed problem, at least on your hardware and software setup. In particular, you have to add the definition and contents of the table in your snippet. If you find it impossible to reproduce the problem on a toy table with code-generated or manually inserted fake values, you should upload the contents of your table, e.g. as parquet file, and include the code that reads the parquet file in your snippet.

from duckdb.

ASaltedFishBoy commented on September 22, 2024

"To reproduce" means you are supposed to provide concrete code that reproduces the observed problem, at least on your hardware and software setup. In particular, you have to add the definition and contents of the table in your snippet. If you find it impossible to reproduce the problem on a toy table with a few manually inserted values, you should upload the contents of your table, e.g. as parquet file, and include the code that reads the parquet file in your snippet.

This is my table construction statement,the part that does not contain data insertion of CAN_TEST.I have nothing more can offer.

CREATE SEQUENCE id_sequence START 1
CREATE TABLE CAN_TEST (trace_id BIGINT PRIMARY KEY DEFAULT nextval('id_sequence'),timestamp BIGINT, others...)
CREATE UNIQUE INDEX CAN_TEST_trace_id_IDX ON main.CAN_TEST  (trace_id)

CREATE TABLE CAN_TEST_timestamp (trace_id BIGINT PRIMARY KEY ,timestamp BIGINT)
INSERT INTO CAN_TEST_timestamp (SELECT trace_id,timestamp FROM main.CAN_TEST ORDER BY "timestamp")

Then there are the two statements in issue.

from duckdb.

soerenwolfers commented on September 22, 2024

You could have turned this into a MWE like this:

CREATE OR REPLACE SEQUENCE id_sequence START 1;

CREATE OR REPLACE TABLE CAN_TEST (trace_id BIGINT PRIMARY KEY DEFAULT nextval('id_sequence'), timestamp BIGINT, other1 REAL, other2 TIMESTAMP);
CREATE UNIQUE INDEX CAN_TEST_trace_id_IDX ON main.CAN_TEST  (trace_id);

INSERT INTO CAN_TEST (SELECT 
    unnest(ARRAY(SELECT * FROM range(10_000_000))),
    unnest(ARRAY(SELECT random() * 100 FROM range(10_000_000))),
    unnest(ARRAY(SELECT random() FROM range(10_000_000))),
    unnest(ARRAY(SELECT '2020-01-01'::DATE + to_seconds(range) FROM range(10_000_000)))
);

CREATE OR REPLACE TABLE CAN_TEST_timestamp (trace_id BIGINT PRIMARY KEY, timestamp BIGINT);
INSERT INTO CAN_TEST_timestamp (SELECT trace_id,timestamp FROM main.CAN_TEST ORDER BY "timestamp");

SELECT * FROM (SELECT * FROM (SELECT * FROM main.CAN_TEST_timestamp ORDER BY "timestamp") LIMIT 4000 OFFSET 8000000) AS timeOrder INNER JOIN main.CAN_TEST ON main.CAN_TEST.trace_id == timeOrder.trace_id ORDER BY (timeOrder."timestamp",timeOrder.trace_id);

449 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

SELECT * FROM (SELECT * FROM (SELECT * FROM main.CAN_TEST_timestamp ORDER BY "timestamp") LIMIT 4000 OFFSET 8000000) AS timeOrder LEFT JOIN main.CAN_TEST ON main.CAN_TEST.trace_id == timeOrder.trace_id ORDER BY (timeOrder."timestamp",timeOrder.trace_id);

603 ms ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I do see a performance difference, but nowhere near the size you observe. Maybe it's the "other" columns? Maybe it's the number of rows? Maybe it's the datatypes of the other columns?

from duckdb.

ASaltedFishBoy commented on September 22, 2024

You could have turned this into a MWE like this:

CREATE OR REPLACE SEQUENCE id_sequence START 1;

CREATE OR REPLACE TABLE CAN_TEST (trace_id BIGINT PRIMARY KEY DEFAULT nextval('id_sequence'), timestamp BIGINT, other1 REAL, other2 TIMESTAMP);
CREATE UNIQUE INDEX CAN_TEST_trace_id_IDX ON main.CAN_TEST  (trace_id);

INSERT INTO CAN_TEST (SELECT 
    unnest(ARRAY(SELECT * FROM range(10_000_000))),
    unnest(ARRAY(SELECT random() * 100 FROM range(10_000_000))),
    unnest(ARRAY(SELECT random() FROM range(10_000_000))),
    unnest(ARRAY(SELECT '2020-01-01'::DATE + to_seconds(range) FROM range(10_000_000)))
);

CREATE OR REPLACE TABLE CAN_TEST_timestamp (trace_id BIGINT PRIMARY KEY, timestamp BIGINT);
INSERT INTO CAN_TEST_timestamp (SELECT trace_id,timestamp FROM main.CAN_TEST ORDER BY "timestamp");

SELECT * FROM (SELECT * FROM (SELECT * FROM main.CAN_TEST_timestamp ORDER BY "timestamp") LIMIT 4000 OFFSET 8000000) AS timeOrder INNER JOIN main.CAN_TEST ON main.CAN_TEST.trace_id == timeOrder.trace_id ORDER BY (timeOrder."timestamp",timeOrder.trace_id);

449 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

SELECT * FROM (SELECT * FROM (SELECT * FROM main.CAN_TEST_timestamp ORDER BY "timestamp") LIMIT 4000 OFFSET 8000000) AS timeOrder LEFT JOIN main.CAN_TEST ON main.CAN_TEST.trace_id == timeOrder.trace_id ORDER BY (timeOrder."timestamp",timeOrder.trace_id);

603 ms ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I do see a performance difference, but nowhere near the size you observe. Maybe it's the "other" columns? Maybe it's the number of rows? Maybe it's the datatypes of the other columns?

There are roughly 30 columns of data here, of different types(int64,string or bytes), but the data itself is not large (1 row of total bytes is roughly 200bytes), and the other differences seem to be absent.
I am so sorry that I cannot provide a specific table definition, which involves confidentiality.

from duckdb.

Performance issues using LEFT JOIN about duckdb HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent