Giter VIP home page Giter VIP logo

Comments (5)

vnnw avatar vnnw commented on August 16, 2024

How did you insert the data into cstore tables? If the data were inserted by sorting columns that you will later use on where/group by clause, I think that cstore will shine then.

from cstore_fdw.

mtuncer avatar mtuncer commented on August 16, 2024

cstore operates as a foreign data wrapper to postgresql. Data is first fetched from cstore, then any grouping and counting is performed.

in the first query (SELECT zip, COUNT() FROM _cstore group by 1), all values belonging to zip column is fetched from cstore, and grouping/counting performed by postgresql afterwards.

2nd query is similar, again all state and phone1 values are fetched, then grouping and distinct counting is performed by postgresql.

3rd query is different in the sense that it has filtering, cstore runs through field values and prunes out data blocks that do not have a match.

So general reasons for queries to take long are (1) having to return all values for selected columns, (2) group by/distincts on 16M rows.

I would like to investigate this issue further. Did you have chance to run those queries in regular posgresql tables for comparison purposes ?

from cstore_fdw.

mtuncer avatar mtuncer commented on August 16, 2024

Regarding question about inserts.

cstore fdw uses multiple levels of grouping when inserting rows.

top level group is stripe. A stripe contains blocks, finally blocks have data blocks for columns.

in default settings, a stripe has 15 blocks, which has 10000 rows. Therefore a stripe contains 150K rows.

cstore reads stripe at a time when loading from disk.

cstore also stores min/max values for each column at block and stripe level. Therefore, it is possible to skip stripes and disk blocks that does not contribute to query results when a filter is applied. If there is no filter, query becomes a sequential scan of the table.

cstore does not do any sorting during insertion. pre-sorting the input wrt most queried field would increase query performance since it will reduce disk reads and number of rows returned to postgresql.

from cstore_fdw.

k1th avatar k1th commented on August 16, 2024

@mtuncer
Regarding moving data from cstore_fdw to PG:
We quite a big table with 30 million records both in cstore and in PG using these two queries:

select customerno, count(objectid) from object_columnar
where
donets >= '2015-07-01' and donets < '2015-08-01'
group by customerno

select customerno, count(objectid) from object
where
donets >= '2015-07-01' and donets < '2015-08-01'
group by customerno


select a, b, c, donets, d, e, f, g from object_columnar
where customerno = 'abdcdef' and donets >= '2015-07-01' and donets < '2015-08-01'
order by donets

select a, b, c, donets, d, e, f, g  from object
where customerno = 'abdcdef' and donets >= '2015-07-01' and donets < '2015-08-01'
order by donets

Using a regular select, the postgres table (which has an index on donets) is faster by a factor of 20 until the first results arrive (limit 20000)
Using the group by, the cstore query is faster by a factor of 50 or more.

I cannot explain this behaviour, as the group by happens within PG and the amount of data to be transferred from cstore to PG should be significantly bigger in the group by query and theorefore the query should be slower.

Why could this be the case?

Thanks for your help

from cstore_fdw.

mtuncer avatar mtuncer commented on August 16, 2024

One big difference in you group by queries is that you are fetching only 3
columns from the table. This is the type of query that cstore will be
better at. (accessing small number of columns in the table)

The other query (order by) gets 9 fields. That may have significant affect
on total query time.

On Thu, Aug 27, 2015 at 4:57 PM, Chris Schwarzfischer <
[email protected]> wrote:

Regarding moving data from cstore_fdw to PG:
We quite a big table with 30 million records both in cstore and in PG
using these two queries:

select customerno, count(objectid) from object_columnar
where
donets >= '2015-07-01' and donets < '2015-08-01'
group by customerno

select customerno, count(objectid) from object
where
donets >= '2015-07-01' and donets < '2015-08-01'
group by customerno

select a, b, c, donets, d, e, f, g from object_columnar
where customerno = 'abdcdef' and donets >= '2015-07-01' and donets < '2015-08-01'
order by donets

select a, b, c, donets, d, e, f, g from object
where customerno = 'abdcdef' and donets >= '2015-07-01' and donets < '2015-08-01'
order by donets

Using a regular select, the postgres table (which has an index on donets)
is faster by a factor of 20.
Using the group by, the cstore query is faster by a factor of 50 or more.

I cannot explain this behaviour, as the group by happens within PG and the
amount of data to be transferred from cstore to PG should be significantly
bigger in the group by query and theorefore the query should be slower.

Why could this be the case?

Thanks for your help


Reply to this email directly or view it on GitHub
#68 (comment)
.

_Murat Tuncer_Software Engineer | Citus Data
[email protected]

from cstore_fdw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.