I have loaded a cstore_fdw with about 10 columns of mostly int/date/varchar datatypes

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

things to check if no performance improvement seen? about cstore_fdw HOT 5 CLOSED

citusdata commented on August 16, 2024

things to check if no performance improvement seen?

from cstore_fdw.

Comments (5)

vnnw commented on August 16, 2024

How did you insert the data into cstore tables? If the data were inserted by sorting columns that you will later use on where/group by clause, I think that cstore will shine then.

from cstore_fdw.

mtuncer commented on August 16, 2024

cstore operates as a foreign data wrapper to postgresql. Data is first fetched from cstore, then any grouping and counting is performed.

in the first query (SELECT zip, COUNT() FROM _cstore group by 1), all values belonging to zip column is fetched from cstore, and grouping/counting performed by postgresql afterwards.

2nd query is similar, again all state and phone1 values are fetched, then grouping and distinct counting is performed by postgresql.

3rd query is different in the sense that it has filtering, cstore runs through field values and prunes out data blocks that do not have a match.

So general reasons for queries to take long are (1) having to return all values for selected columns, (2) group by/distincts on 16M rows.

I would like to investigate this issue further. Did you have chance to run those queries in regular posgresql tables for comparison purposes ?

from cstore_fdw.

mtuncer commented on August 16, 2024

Regarding question about inserts.

cstore fdw uses multiple levels of grouping when inserting rows.

top level group is stripe. A stripe contains blocks, finally blocks have data blocks for columns.

in default settings, a stripe has 15 blocks, which has 10000 rows. Therefore a stripe contains 150K rows.

cstore reads stripe at a time when loading from disk.

cstore also stores min/max values for each column at block and stripe level. Therefore, it is possible to skip stripes and disk blocks that does not contribute to query results when a filter is applied. If there is no filter, query becomes a sequential scan of the table.

cstore does not do any sorting during insertion. pre-sorting the input wrt most queried field would increase query performance since it will reduce disk reads and number of rows returned to postgresql.

from cstore_fdw.

k1th commented on August 16, 2024

@mtuncer
Regarding moving data from cstore_fdw to PG:
We quite a big table with 30 million records both in cstore and in PG using these two queries:

select customerno, count(objectid) from object_columnar
where
donets >= '2015-07-01' and donets < '2015-08-01'
group by customerno

select customerno, count(objectid) from object
where
donets >= '2015-07-01' and donets < '2015-08-01'
group by customerno


select a, b, c, donets, d, e, f, g from object_columnar
where customerno = 'abdcdef' and donets >= '2015-07-01' and donets < '2015-08-01'
order by donets

select a, b, c, donets, d, e, f, g  from object
where customerno = 'abdcdef' and donets >= '2015-07-01' and donets < '2015-08-01'
order by donets

Using a regular select, the postgres table (which has an index on donets) is faster by a factor of 20 until the first results arrive (limit 20000)
Using the group by, the cstore query is faster by a factor of 50 or more.

I cannot explain this behaviour, as the group by happens within PG and the amount of data to be transferred from cstore to PG should be significantly bigger in the group by query and theorefore the query should be slower.

Why could this be the case?

Thanks for your help

from cstore_fdw.

mtuncer commented on August 16, 2024

One big difference in you group by queries is that you are fetching only 3
columns from the table. This is the type of query that cstore will be
better at. (accessing small number of columns in the table)

The other query (order by) gets 9 fields. That may have significant affect
on total query time.

On Thu, Aug 27, 2015 at 4:57 PM, Chris Schwarzfischer <
[email protected]> wrote:

Regarding moving data from cstore_fdw to PG:
We quite a big table with 30 million records both in cstore and in PG
using these two queries:

select customerno, count(objectid) from object_columnar
where
donets >= '2015-07-01' and donets < '2015-08-01'
group by customerno

select customerno, count(objectid) from object
where
donets >= '2015-07-01' and donets < '2015-08-01'
group by customerno

select a, b, c, donets, d, e, f, g from object_columnar
where customerno = 'abdcdef' and donets >= '2015-07-01' and donets < '2015-08-01'
order by donets

select a, b, c, donets, d, e, f, g from object
where customerno = 'abdcdef' and donets >= '2015-07-01' and donets < '2015-08-01'
order by donets

Using a regular select, the postgres table (which has an index on donets)
is faster by a factor of 20.
Using the group by, the cstore query is faster by a factor of 50 or more.

I cannot explain this behaviour, as the group by happens within PG and the
amount of data to be transferred from cstore to PG should be significantly
bigger in the group by query and theorefore the query should be slower.

Why could this be the case?

Thanks for your help

—
Reply to this email directly or view it on GitHub
#68 (comment)
.

_Murat Tuncer_Software Engineer | Citus Data
[email protected]

from cstore_fdw.

things to check if no performance improvement seen? about cstore_fdw HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent