The following query fails with index out of bound error: <div class="snippet-clipb

Sure, drop me a line hannes@duckdblabs.com <

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I'm preparing a document/issue, stay tuned. <p dir="aut

CC: <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

I’d be curious to hear what those “much more” problems are </blockquo

ParquetReader passes wrong columnIds array to DuckDB::ParquetReader,about facebookincubator/velox

Comments (17)

yingsu00 commented on July 29, 2024 1

Do you have a sense of when you might have something to share?
@mbasmanova Should be by next Thursday.

from velox.

hannes commented on July 29, 2024 1

Sure, drop me a line [email protected]

from velox.

mbasmanova commented on July 29, 2024 1

@yingsu00 Here is a nice blog post about Parquet reader in DuckDB: https://duckdb.org/2021/06/25/querying-parquet.html

from velox.

yingsu00 commented on July 29, 2024

cc @mbasmanova @majetideepak @aditi-pandit @frankobe
Before this is fixed, almost all queries on Parquet would return wrong results or just fail.

from velox.

mbasmanova commented on July 29, 2024

@yingsu00 Thank you file filing an issue in DuckDB project. I saw a reply from Hannes where he acknowledged the problem and stated that he'd be happy to review the fix. Would you like to work on that?

from velox.

yingsu00 commented on July 29, 2024

@mbasmanova DuckDB parquet reader has much more problems than this. We need to come up with a plan together. I'm preparing a document/issue, stay tuned.

from velox.

mbasmanova commented on July 29, 2024

I'm preparing a document/issue, stay tuned.

@yingsu00 Do you have a sense of when you might have something to share?

from velox.

mbasmanova commented on July 29, 2024

CC: @Mytherin @hannesmuehleisen

from velox.

hannes commented on July 29, 2024

I’d be curious to hear what those “much more” problems are

from velox.

yingsu00 commented on July 29, 2024

I’d be curious to hear what those “much more” problems are

@hannesmuehleisen Maybe I should rephrase the word "problem" to "unsupported features" and "future improvements", for example, Velox need to control and account for the memory Parquet reader uses. As another example, only RLE_DICTIONARY, PLAIN_DICTIONARY and PLAIN encodings are supported and we need to expand the coverage. Another example is to support rowgroup elimination based on rowgroup stats. Currently the Parquet reader is nearly 2x slower than Velox DWRF reader, even though the latter had to use expensive Varint encoding while Parquet doesn't. I'll describe more items in a Github issue.

Also, the engineering cost for us to help improve DuckDB Parquet reader is quite high currently: we had to copy the DuckDB code, amalgamate them into huge files that slow down and freeze the IDE quite often, and it's very hard for us to test a DuckDB PR with Velox without porting and formatting the code again. But we want to help and I'll be very happy to discuss options with you folks.

from velox.

hannes commented on July 29, 2024

@yingsu00 Some comments (CC @Mytherin):

As far as I know, we had added memory allocation control to the Parquet reader for you guys, is that not sufficient
Parquet has several specified but hardly used encodings, which ones do you need in particular?
Row group elimination based on stats is supported, maybe its an integration issue from Velox?
Re engineering cost, what did you need to change besides the amalgamation, I agree editing that directly is not nice

Happy to discuss these topics in more detail, too

from velox.

yingsu00 commented on July 29, 2024

cc @dborkar

from velox.

yingsu00 commented on July 29, 2024

@hannesmuehleisen I'm sorry I may have missed the memory allocator for Velox was already implemented, but there are other features that might help, eg caching and read ahead, pushdown aggregation and more. We may also want to optimize for the hardware our services are running on. The encoding we need at least includes DELTA_BINARY_PACKED. But anyways the basic need is we need a way to work on the code easier. Maybe we can discuss offline?

from velox.

majetideepak commented on July 29, 2024

@hannesmuehleisen thanks for all the feedback. We will work with Ying and others to create a document with the missing pieces. I guess we can discuss further from there.

from velox.

mbasmanova commented on July 29, 2024

Row group elimination based on stats is supported, maybe its an integration issue from Velox?

Indeed, I see that being the case. There might be a bug in that logic though. I'm reading Parquet files for the 'lineitem' table created by Presto and these appear to have incomplete stats, e.g. min is present, but max is not. This generated a failure when pushing down quantity < 24 filter.

INTERNAL Error: Invalid PhysicalType for GetTypeIdSize

Here is a repro: #881

from velox.

yingsu00 commented on July 29, 2024

Sure, drop me a line [email protected]

@hannesmuehleisen Thank you for your email. I actually already pinged you on linkedin. We'll send you something next week.

from velox.

yingsu00 commented on July 29, 2024

@mbasmanova @hannesmuehleisen Per our internal discussions, we will contact DuckDB offline. I will hold off publishing the document as a Github issue now.

from velox.

ParquetReader passes wrong columnIds array to DuckDB::ParquetReader about velox HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent