Bug deion We observed one join operator in TPCDS query72 (CB

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Long time spent in function 'listJoinResults' for array mode join probe about velox HOT 6 OPEN

zhli1142015 commented on July 18, 2024 1

Long time spent in function 'listJoinResults' for array mode join probe

from velox.

Comments (6)

mbasmanova commented on July 18, 2024 1

@zhli1142015 Thank you for reporting and investigating this issue. I have some questions.

What is the performance of vanilla Spark on this query?

Within this join, there is a significant number of duplicate rows on the build side.

How many build-side rows are there per unique key? Is there a skew where some of the keys have a lot of build-side rows while others a few?

I see that number of join output rows is ~ 2x of the number of probe rows. Is it the case that only small subset of probe rows match, but they match with many build side rows? I don't have easy access to TPC-DS dataset, hence, cannot easily check these cardinalities myself.

Thank you also for proposing a solution. I see that the solution applies only to kArrayMode, but isn't this problem more generic and may happen in any hash mode? If that's the case, let's make sure the solution is generic as well.

To help evaluated proposed solution and, perhaps, iterate on other options, it would be helpful to create a benchmark that reproduces the issue. Would you be willing to help with that?

std::unordered_map<char*, std::shared_ptr<std::vector<char*>>> duplicateRows_;

I see you are using std::unordered_map. It might be more efficient to use folly::F14FastMap.

I also noticed that memory for duplicateRows_ is allocated via malloc directly and therefore is not accounted for in Velox memory pools. Let's make sure we allocate memory from a pool. See StlAllocator and AlignedStlAllocator in velox/common/memory/HashStringAllocator.h

I also noticed that you use std::shared_ptr over the vector. What's the motivation for doing that? Why no use std::vector directly?

Once we have a benchmark, it would be nice to check whether this optimization always works or if there is a regression in some cases, i.e. when the number of duplicates is low (2).

Let me know how you'd like to proceed.

CC: @Yuhta @xiaoxmeng

from velox.

Yuhta commented on July 18, 2024 1

We also need to update the address list when we erase rows. I would suggest we put the list inside row container and avoid a second probe for duplicates.

from velox.

zhli1142015 commented on July 18, 2024

Hello @mbasmanova , could you help check this? Thanks.

from velox.

zhli1142015 commented on July 18, 2024

Thanks @mbasmanova for your suggestions.

What is the performance of vanilla Spark on this query?

In our test, when using Spark, the latency is 39 seconds, and when using Velox, the latency is 60 seconds. After applying this fix, the latency is reduced to 40 seconds.

Is it the case that only small subset of probe rows match, but they match with many build side rows?

Yes, this is the scenario I observed. I collected the sizes of all duplicate row vextors by logs. The average size is over sixty.

I see that the solution applies only to kArrayMode, but isn't this problem more generic and may happen in any hash mode?

I feel this problem is a common issue for all hash modes too. I fix it only for array mode, as this is the only pattern we obsrved in TPCDS. I can try different modes with benchmark to see if this is a common issue.

Do you think if it's ok to address your comments and include a benchmark in the same pull request?

Some comments from you:

use folly::F14FastMap to replace std::unordered_map.
memory for duplicateRows_ should be allocated from a pool.
Use std::vector instead of shared pointer.

The scenarios to verify would include: 1) when the number of duplicates is low (e.g., 2), and 2) when there are a high number of duplicates (e.g., 100 or more). Please let me know if there are any other cases I should cover in the benchmark.
Thanks.

from velox.

mbasmanova commented on July 18, 2024

The average size is over sixty.

Got it. Would you clarify a bit further about the distribution? For example, can you tell what are p25, p50, p90, p95 and max or describe the distribution in some other way?

I can try different modes with benchmark to see if this is a common issue.

That would be great. Thanks.

Do you think if it's ok to address your comments and include a benchmark in the same pull request?

I suggest to work with a single PR for now. Once we have the full solution we can decide whether it needs to be split into multiple PRs. The first step is to figure out in which cases we have a problem and what's the best way to address all these cases without regressing in other cases.

when the number of duplicates is low (e.g., 2), and 2) when there are a high number of duplicates (e.g., 100 or more).

It would be nice to write the benchmark in a way that allows us to easily test different distributions, e.g. 50% row have 2 dups, 35 have 10 dups, 10 have 50 dups, 5 have 100 dups (or something along these lines).

from velox.

Yuhta commented on July 18, 2024

Also would be nice if we have a benchmark for erasing performance. This list will be faster to traverse but slower to update.

from velox.

Long time spent in function 'listJoinResults' for array mode join probe about velox HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent