Comments (4)
"To reproduce" means you are supposed to provide concrete code that reproduces the observed problem, at least on your hardware and software setup. In particular, you have to add the definition and contents of the table in your snippet. If you find it impossible to reproduce the problem on a toy table with code-generated or manually inserted fake values, you should upload the contents of your table, e.g. as parquet file, and include the code that reads the parquet file in your snippet.
from duckdb.
"To reproduce" means you are supposed to provide concrete code that reproduces the observed problem, at least on your hardware and software setup. In particular, you have to add the definition and contents of the table in your snippet. If you find it impossible to reproduce the problem on a toy table with a few manually inserted values, you should upload the contents of your table, e.g. as parquet file, and include the code that reads the parquet file in your snippet.
This is my table construction statement,the part that does not contain data insertion of CAN_TEST.I have nothing more can offer.
CREATE SEQUENCE id_sequence START 1
CREATE TABLE CAN_TEST (trace_id BIGINT PRIMARY KEY DEFAULT nextval('id_sequence'),timestamp BIGINT, others...)
CREATE UNIQUE INDEX CAN_TEST_trace_id_IDX ON main.CAN_TEST (trace_id)
CREATE TABLE CAN_TEST_timestamp (trace_id BIGINT PRIMARY KEY ,timestamp BIGINT)
INSERT INTO CAN_TEST_timestamp (SELECT trace_id,timestamp FROM main.CAN_TEST ORDER BY "timestamp")
Then there are the two statements in issue.
from duckdb.
You could have turned this into a MWE like this:
CREATE OR REPLACE SEQUENCE id_sequence START 1;
CREATE OR REPLACE TABLE CAN_TEST (trace_id BIGINT PRIMARY KEY DEFAULT nextval('id_sequence'), timestamp BIGINT, other1 REAL, other2 TIMESTAMP);
CREATE UNIQUE INDEX CAN_TEST_trace_id_IDX ON main.CAN_TEST (trace_id);
INSERT INTO CAN_TEST (SELECT
unnest(ARRAY(SELECT * FROM range(10_000_000))),
unnest(ARRAY(SELECT random() * 100 FROM range(10_000_000))),
unnest(ARRAY(SELECT random() FROM range(10_000_000))),
unnest(ARRAY(SELECT '2020-01-01'::DATE + to_seconds(range) FROM range(10_000_000)))
);
CREATE OR REPLACE TABLE CAN_TEST_timestamp (trace_id BIGINT PRIMARY KEY, timestamp BIGINT);
INSERT INTO CAN_TEST_timestamp (SELECT trace_id,timestamp FROM main.CAN_TEST ORDER BY "timestamp");
SELECT * FROM (SELECT * FROM (SELECT * FROM main.CAN_TEST_timestamp ORDER BY "timestamp") LIMIT 4000 OFFSET 8000000) AS timeOrder INNER JOIN main.CAN_TEST ON main.CAN_TEST.trace_id == timeOrder.trace_id ORDER BY (timeOrder."timestamp",timeOrder.trace_id);
449 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
SELECT * FROM (SELECT * FROM (SELECT * FROM main.CAN_TEST_timestamp ORDER BY "timestamp") LIMIT 4000 OFFSET 8000000) AS timeOrder LEFT JOIN main.CAN_TEST ON main.CAN_TEST.trace_id == timeOrder.trace_id ORDER BY (timeOrder."timestamp",timeOrder.trace_id);
603 ms ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I do see a performance difference, but nowhere near the size you observe. Maybe it's the "other" columns? Maybe it's the number of rows? Maybe it's the datatypes of the other columns?
from duckdb.
You could have turned this into a MWE like this:
CREATE OR REPLACE SEQUENCE id_sequence START 1; CREATE OR REPLACE TABLE CAN_TEST (trace_id BIGINT PRIMARY KEY DEFAULT nextval('id_sequence'), timestamp BIGINT, other1 REAL, other2 TIMESTAMP); CREATE UNIQUE INDEX CAN_TEST_trace_id_IDX ON main.CAN_TEST (trace_id); INSERT INTO CAN_TEST (SELECT unnest(ARRAY(SELECT * FROM range(10_000_000))), unnest(ARRAY(SELECT random() * 100 FROM range(10_000_000))), unnest(ARRAY(SELECT random() FROM range(10_000_000))), unnest(ARRAY(SELECT '2020-01-01'::DATE + to_seconds(range) FROM range(10_000_000))) ); CREATE OR REPLACE TABLE CAN_TEST_timestamp (trace_id BIGINT PRIMARY KEY, timestamp BIGINT); INSERT INTO CAN_TEST_timestamp (SELECT trace_id,timestamp FROM main.CAN_TEST ORDER BY "timestamp");SELECT * FROM (SELECT * FROM (SELECT * FROM main.CAN_TEST_timestamp ORDER BY "timestamp") LIMIT 4000 OFFSET 8000000) AS timeOrder INNER JOIN main.CAN_TEST ON main.CAN_TEST.trace_id == timeOrder.trace_id ORDER BY (timeOrder."timestamp",timeOrder.trace_id);449 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
SELECT * FROM (SELECT * FROM (SELECT * FROM main.CAN_TEST_timestamp ORDER BY "timestamp") LIMIT 4000 OFFSET 8000000) AS timeOrder LEFT JOIN main.CAN_TEST ON main.CAN_TEST.trace_id == timeOrder.trace_id ORDER BY (timeOrder."timestamp",timeOrder.trace_id);603 ms ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I do see a performance difference, but nowhere near the size you observe. Maybe it's the "other" columns? Maybe it's the number of rows? Maybe it's the datatypes of the other columns?
There are roughly 30 columns of data here, of different types(int64,string or bytes), but the data itself is not large (1 row of total bytes is roughly 200bytes), and the other differences seem to be absent.
I am so sorry that I cannot provide a specific table definition, which involves confidentiality.
from duckdb.
Related Issues (20)
- Upsert complains about lack of constrains despite the fact they are present HOT 4
- 'Duplicate key' error when update column with list data type and table has a primary key HOT 3
- Compilation warnings in re2 with `-Wpedantic` HOT 10
- Running simple `SELECT` over S3 attached DuckDB db segfaults HOT 4
- Wrong comparison for timestamp depending on if it's in a subquery or not HOT 2
- PIVOT does not support geomean as aggregation function HOT 2
- error when inserting duped values into primary column with many values HOT 1
- `CAST( 'yes' AS BOOLEAN)` fails HOT 3
- No facility to parse formatted Number Strings to Numeric/Double/Decimal HOT 2
- MemoryError: std::bad_alloc HOT 7
- `Round()` and `Round_Even()` return different type HOT 1
- ORDER BY with nested list type gives wrong result
- Support brotli compression for Parquet files HOT 1
- Convert Column Type Failed HOT 3
- `REGEXP_REPLACE` replaces only the first match without any option for all matches HOT 2
- Shadowing of variable names in Python HOT 2
- Querying empty Polars dataframe produces internal error
- CURRENT_TIMESTAMP + INTERVAL 1 DAY fails HOT 5
- French stemmer doesn't work using FTS? HOT 3
- Boolean implicitly cast to integer in UNION clause HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from duckdb.