Comments (6)
@zhli1142015 Thank you for reporting and investigating this issue. I have some questions.
What is the performance of vanilla Spark on this query?
Within this join, there is a significant number of duplicate rows on the build side.
How many build-side rows are there per unique key? Is there a skew where some of the keys have a lot of build-side rows while others a few?
I see that number of join output rows is ~ 2x of the number of probe rows. Is it the case that only small subset of probe rows match, but they match with many build side rows? I don't have easy access to TPC-DS dataset, hence, cannot easily check these cardinalities myself.
Thank you also for proposing a solution. I see that the solution applies only to kArrayMode, but isn't this problem more generic and may happen in any hash mode? If that's the case, let's make sure the solution is generic as well.
To help evaluated proposed solution and, perhaps, iterate on other options, it would be helpful to create a benchmark that reproduces the issue. Would you be willing to help with that?
std::unordered_map<char*, std::shared_ptr<std::vector<char*>>> duplicateRows_;
I see you are using std::unordered_map. It might be more efficient to use folly::F14FastMap.
I also noticed that memory for duplicateRows_ is allocated via malloc directly and therefore is not accounted for in Velox memory pools. Let's make sure we allocate memory from a pool. See StlAllocator and AlignedStlAllocator in velox/common/memory/HashStringAllocator.h
I also noticed that you use std::shared_ptr over the vector. What's the motivation for doing that? Why no use std::vector directly?
Once we have a benchmark, it would be nice to check whether this optimization always works or if there is a regression in some cases, i.e. when the number of duplicates is low (2).
Let me know how you'd like to proceed.
CC: @Yuhta @xiaoxmeng
from velox.
We also need to update the address list when we erase rows. I would suggest we put the list inside row container and avoid a second probe for duplicates.
from velox.
Hello @mbasmanova , could you help check this? Thanks.
from velox.
Thanks @mbasmanova for your suggestions.
What is the performance of vanilla Spark on this query?
In our test, when using Spark, the latency is 39 seconds, and when using Velox, the latency is 60 seconds. After applying this fix, the latency is reduced to 40 seconds.
Is it the case that only small subset of probe rows match, but they match with many build side rows?
Yes, this is the scenario I observed. I collected the sizes of all duplicate row vextors by logs. The average size is over sixty.
I see that the solution applies only to kArrayMode, but isn't this problem more generic and may happen in any hash mode?
I feel this problem is a common issue for all hash modes too. I fix it only for array mode, as this is the only pattern we obsrved in TPCDS. I can try different modes with benchmark to see if this is a common issue.
Do you think if it's ok to address your comments and include a benchmark in the same pull request?
Some comments from you:
- use folly::F14FastMap to replace std::unordered_map.
- memory for duplicateRows_ should be allocated from a pool.
- Use std::vector instead of shared pointer.
The scenarios to verify would include: 1) when the number of duplicates is low (e.g., 2), and 2) when there are a high number of duplicates (e.g., 100 or more). Please let me know if there are any other cases I should cover in the benchmark.
Thanks.
from velox.
The average size is over sixty.
Got it. Would you clarify a bit further about the distribution? For example, can you tell what are p25, p50, p90, p95 and max or describe the distribution in some other way?
I can try different modes with benchmark to see if this is a common issue.
That would be great. Thanks.
Do you think if it's ok to address your comments and include a benchmark in the same pull request?
I suggest to work with a single PR for now. Once we have the full solution we can decide whether it needs to be split into multiple PRs. The first step is to figure out in which cases we have a problem and what's the best way to address all these cases without regressing in other cases.
- when the number of duplicates is low (e.g., 2), and 2) when there are a high number of duplicates (e.g., 100 or more).
It would be nice to write the benchmark in a way that allows us to easily test different distributions, e.g. 50% row have 2 dups, 35 have 10 dups, 10 have 50 dups, 5 have 100 dups (or something along these lines).
from velox.
Also would be nice if we have a benchmark for erasing performance. This list will be faster to traverse but slower to update.
from velox.
Related Issues (20)
- Dependencies of GCS have not been updated for nearly a year
- Nightly Jobs (Scheduled Jobs) are failing HOT 4
- UBSAN: overflow and hookLimits in SumAggregationTest failed HOT 1
- Suspected bug related to read with Mutation
- Add NO_PROXY CIDR support for s3 hive connector HOT 1
- System crashes while build the velox HOT 4
- Join spilling support in mixed execution mode
- Right side join is not supported for mixed execution mode
- Add decimal support for min(x,n), max(x,n) HOT 1
- Upgrade presto docker image to Presto 0.286
- approx_set Presto function in Velox supports more signatures than Presto HOT 3
- Add support for manylinux_2_28 from manylinux 2014
- Fix test method names in sparksql/tests/StringTest.cpp
- Fails to create vector function with hard-coded scale HOT 1
- Add Classification Metrics Aggregate Presto Functions
- Make test methods lambdas in sparksql/tests/StringTest.cpp
- Presto aggregate function regr_r2 fails in AggregationFuzzer HOT 2
- Build ERROR ON APPLE M1
- Add support for DECIMAL types to Simple Function API HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from velox.