Comments (17)
Do you have a sense of when you might have something to share?
@mbasmanova Should be by next Thursday.
from velox.
Sure, drop me a line [email protected]
from velox.
@yingsu00 Here is a nice blog post about Parquet reader in DuckDB: https://duckdb.org/2021/06/25/querying-parquet.html
from velox.
cc @mbasmanova @majetideepak @aditi-pandit @frankobe
Before this is fixed, almost all queries on Parquet would return wrong results or just fail.
from velox.
@yingsu00 Thank you file filing an issue in DuckDB project. I saw a reply from Hannes where he acknowledged the problem and stated that he'd be happy to review the fix. Would you like to work on that?
from velox.
@mbasmanova DuckDB parquet reader has much more problems than this. We need to come up with a plan together. I'm preparing a document/issue, stay tuned.
from velox.
I'm preparing a document/issue, stay tuned.
@yingsu00 Do you have a sense of when you might have something to share?
from velox.
CC: @Mytherin @hannesmuehleisen
from velox.
I’d be curious to hear what those “much more” problems are
from velox.
I’d be curious to hear what those “much more” problems are
@hannesmuehleisen Maybe I should rephrase the word "problem" to "unsupported features" and "future improvements", for example, Velox need to control and account for the memory Parquet reader uses. As another example, only RLE_DICTIONARY, PLAIN_DICTIONARY and PLAIN encodings are supported and we need to expand the coverage. Another example is to support rowgroup elimination based on rowgroup stats. Currently the Parquet reader is nearly 2x slower than Velox DWRF reader, even though the latter had to use expensive Varint encoding while Parquet doesn't. I'll describe more items in a Github issue.
Also, the engineering cost for us to help improve DuckDB Parquet reader is quite high currently: we had to copy the DuckDB code, amalgamate them into huge files that slow down and freeze the IDE quite often, and it's very hard for us to test a DuckDB PR with Velox without porting and formatting the code again. But we want to help and I'll be very happy to discuss options with you folks.
from velox.
@yingsu00 Some comments (CC @Mytherin):
- As far as I know, we had added memory allocation control to the Parquet reader for you guys, is that not sufficient
- Parquet has several specified but hardly used encodings, which ones do you need in particular?
- Row group elimination based on stats is supported, maybe its an integration issue from Velox?
- Re engineering cost, what did you need to change besides the amalgamation, I agree editing that directly is not nice
Happy to discuss these topics in more detail, too
from velox.
cc @dborkar
from velox.
@hannesmuehleisen I'm sorry I may have missed the memory allocator for Velox was already implemented, but there are other features that might help, eg caching and read ahead, pushdown aggregation and more. We may also want to optimize for the hardware our services are running on. The encoding we need at least includes DELTA_BINARY_PACKED. But anyways the basic need is we need a way to work on the code easier. Maybe we can discuss offline?
from velox.
@hannesmuehleisen thanks for all the feedback. We will work with Ying and others to create a document with the missing pieces. I guess we can discuss further from there.
from velox.
Row group elimination based on stats is supported, maybe its an integration issue from Velox?
Indeed, I see that being the case. There might be a bug in that logic though. I'm reading Parquet files for the 'lineitem' table created by Presto and these appear to have incomplete stats, e.g. min is present, but max is not. This generated a failure when pushing down quantity < 24
filter.
INTERNAL Error: Invalid PhysicalType for GetTypeIdSize
Here is a repro: #881
from velox.
Sure, drop me a line [email protected]
@hannesmuehleisen Thank you for your email. I actually already pinged you on linkedin. We'll send you something next week.
from velox.
@mbasmanova @hannesmuehleisen Per our internal discussions, we will contact DuckDB offline. I will hold off publishing the document as a Github issue now.
from velox.
Related Issues (20)
- Velox parquet scan fail when select row index column before data column HOT 2
- Move to Centos9 Stream as Centos8 Stream is going EOL 2024-05-31 HOT 2
- Support Oniguruma-based regex functions. HOT 1
- TIMESTAMP type cannot represent seconds representable in Spark HOT 4
- Optimize try_cast(varchar as date/timestamp)
- Refactor DateTimeFormatter::parse to return velox::Expected HOT 3
- Create a NOTICE file
- Parquet PageReader incorrectly skips rep/def levels when the max values are 0
- Print stack trace on fuzzer crashes in CI
- velox_test_util is being used in non-test targets HOT 5
- Folly F14Table.h rehashImpl Assertion failure HOT 2
- S3FileSystemRegistrationTest.cpp:69:56: error: no matching function HOT 5
- Row index column can't work as expected if no data column been selected HOT 2
- Arrow third-party compilation not passing CMAKE_PREFIX_PATH
- scalar-functions doc examples use not exist method : DecodedVector::values HOT 2
- Spark input_file_name design HOT 15
- Unit tests S3FileSystemTest.viaRegistry and S3FileSystemTest.fileHandle are failing at merge time. HOT 4
- HashJoinTest.failedToReclaimFromHashJoinBuildersInNonReclaimableSection is flaky HOT 1
- Window and aggregation fuzzer failing after #9759 HOT 1
- The performance of right semi could be not as good as left semi HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from velox.