Giter VIP home page Giter VIP logo

Comments (17)

yingsu00 avatar yingsu00 commented on July 29, 2024 1

Do you have a sense of when you might have something to share?
@mbasmanova Should be by next Thursday.

from velox.

hannes avatar hannes commented on July 29, 2024 1

Sure, drop me a line [email protected]

from velox.

mbasmanova avatar mbasmanova commented on July 29, 2024 1

@yingsu00 Here is a nice blog post about Parquet reader in DuckDB: https://duckdb.org/2021/06/25/querying-parquet.html

from velox.

yingsu00 avatar yingsu00 commented on July 29, 2024

cc @mbasmanova @majetideepak @aditi-pandit @frankobe
Before this is fixed, almost all queries on Parquet would return wrong results or just fail.

from velox.

mbasmanova avatar mbasmanova commented on July 29, 2024

@yingsu00 Thank you file filing an issue in DuckDB project. I saw a reply from Hannes where he acknowledged the problem and stated that he'd be happy to review the fix. Would you like to work on that?

from velox.

yingsu00 avatar yingsu00 commented on July 29, 2024

@mbasmanova DuckDB parquet reader has much more problems than this. We need to come up with a plan together. I'm preparing a document/issue, stay tuned.

from velox.

mbasmanova avatar mbasmanova commented on July 29, 2024

I'm preparing a document/issue, stay tuned.

@yingsu00 Do you have a sense of when you might have something to share?

from velox.

mbasmanova avatar mbasmanova commented on July 29, 2024

CC: @Mytherin @hannesmuehleisen

from velox.

hannes avatar hannes commented on July 29, 2024

I’d be curious to hear what those “much more” problems are

from velox.

yingsu00 avatar yingsu00 commented on July 29, 2024

I’d be curious to hear what those “much more” problems are

@hannesmuehleisen Maybe I should rephrase the word "problem" to "unsupported features" and "future improvements", for example, Velox need to control and account for the memory Parquet reader uses. As another example, only RLE_DICTIONARY, PLAIN_DICTIONARY and PLAIN encodings are supported and we need to expand the coverage. Another example is to support rowgroup elimination based on rowgroup stats. Currently the Parquet reader is nearly 2x slower than Velox DWRF reader, even though the latter had to use expensive Varint encoding while Parquet doesn't. I'll describe more items in a Github issue.

Also, the engineering cost for us to help improve DuckDB Parquet reader is quite high currently: we had to copy the DuckDB code, amalgamate them into huge files that slow down and freeze the IDE quite often, and it's very hard for us to test a DuckDB PR with Velox without porting and formatting the code again. But we want to help and I'll be very happy to discuss options with you folks.

from velox.

hannes avatar hannes commented on July 29, 2024

@yingsu00 Some comments (CC @Mytherin):

  • As far as I know, we had added memory allocation control to the Parquet reader for you guys, is that not sufficient
  • Parquet has several specified but hardly used encodings, which ones do you need in particular?
  • Row group elimination based on stats is supported, maybe its an integration issue from Velox?
  • Re engineering cost, what did you need to change besides the amalgamation, I agree editing that directly is not nice

Happy to discuss these topics in more detail, too

from velox.

yingsu00 avatar yingsu00 commented on July 29, 2024

cc @dborkar

from velox.

yingsu00 avatar yingsu00 commented on July 29, 2024

@hannesmuehleisen I'm sorry I may have missed the memory allocator for Velox was already implemented, but there are other features that might help, eg caching and read ahead, pushdown aggregation and more. We may also want to optimize for the hardware our services are running on. The encoding we need at least includes DELTA_BINARY_PACKED. But anyways the basic need is we need a way to work on the code easier. Maybe we can discuss offline?

from velox.

majetideepak avatar majetideepak commented on July 29, 2024

@hannesmuehleisen thanks for all the feedback. We will work with Ying and others to create a document with the missing pieces. I guess we can discuss further from there.

from velox.

mbasmanova avatar mbasmanova commented on July 29, 2024

Row group elimination based on stats is supported, maybe its an integration issue from Velox?

Indeed, I see that being the case. There might be a bug in that logic though. I'm reading Parquet files for the 'lineitem' table created by Presto and these appear to have incomplete stats, e.g. min is present, but max is not. This generated a failure when pushing down quantity < 24 filter.

INTERNAL Error: Invalid PhysicalType for GetTypeIdSize

Here is a repro: #881

from velox.

yingsu00 avatar yingsu00 commented on July 29, 2024

Sure, drop me a line [email protected]

@hannesmuehleisen Thank you for your email. I actually already pinged you on linkedin. We'll send you something next week.

from velox.

yingsu00 avatar yingsu00 commented on July 29, 2024

@mbasmanova @hannesmuehleisen Per our internal discussions, we will contact DuckDB offline. I will hold off publishing the document as a Github issue now.

from velox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.