Giter VIP home page Giter VIP logo

Comments (6)

mbasmanova avatar mbasmanova commented on July 18, 2024 1

@zhli1142015 Thank you for reporting and investigating this issue. I have some questions.

What is the performance of vanilla Spark on this query?

Within this join, there is a significant number of duplicate rows on the build side.

How many build-side rows are there per unique key? Is there a skew where some of the keys have a lot of build-side rows while others a few?

I see that number of join output rows is ~ 2x of the number of probe rows. Is it the case that only small subset of probe rows match, but they match with many build side rows? I don't have easy access to TPC-DS dataset, hence, cannot easily check these cardinalities myself.

Thank you also for proposing a solution. I see that the solution applies only to kArrayMode, but isn't this problem more generic and may happen in any hash mode? If that's the case, let's make sure the solution is generic as well.

To help evaluated proposed solution and, perhaps, iterate on other options, it would be helpful to create a benchmark that reproduces the issue. Would you be willing to help with that?

std::unordered_map<char*, std::shared_ptr<std::vector<char*>>> duplicateRows_;

I see you are using std::unordered_map. It might be more efficient to use folly::F14FastMap.

I also noticed that memory for duplicateRows_ is allocated via malloc directly and therefore is not accounted for in Velox memory pools. Let's make sure we allocate memory from a pool. See StlAllocator and AlignedStlAllocator in velox/common/memory/HashStringAllocator.h

I also noticed that you use std::shared_ptr over the vector. What's the motivation for doing that? Why no use std::vector directly?

Once we have a benchmark, it would be nice to check whether this optimization always works or if there is a regression in some cases, i.e. when the number of duplicates is low (2).

Let me know how you'd like to proceed.

CC: @Yuhta @xiaoxmeng

from velox.

Yuhta avatar Yuhta commented on July 18, 2024 1

We also need to update the address list when we erase rows. I would suggest we put the list inside row container and avoid a second probe for duplicates.

from velox.

zhli1142015 avatar zhli1142015 commented on July 18, 2024

Hello @mbasmanova , could you help check this? Thanks.

from velox.

zhli1142015 avatar zhli1142015 commented on July 18, 2024

Thanks @mbasmanova for your suggestions.

What is the performance of vanilla Spark on this query?

In our test, when using Spark, the latency is 39 seconds, and when using Velox, the latency is 60 seconds. After applying this fix, the latency is reduced to 40 seconds.

Is it the case that only small subset of probe rows match, but they match with many build side rows?

Yes, this is the scenario I observed. I collected the sizes of all duplicate row vextors by logs. The average size is over sixty.

I see that the solution applies only to kArrayMode, but isn't this problem more generic and may happen in any hash mode?

I feel this problem is a common issue for all hash modes too. I fix it only for array mode, as this is the only pattern we obsrved in TPCDS. I can try different modes with benchmark to see if this is a common issue.

Do you think if it's ok to address your comments and include a benchmark in the same pull request?

Some comments from you:

  1. use folly::F14FastMap to replace std::unordered_map.
  2. memory for duplicateRows_ should be allocated from a pool.
  3. Use std::vector instead of shared pointer.

The scenarios to verify would include: 1) when the number of duplicates is low (e.g., 2), and 2) when there are a high number of duplicates (e.g., 100 or more). Please let me know if there are any other cases I should cover in the benchmark.
Thanks.

from velox.

mbasmanova avatar mbasmanova commented on July 18, 2024

The average size is over sixty.

Got it. Would you clarify a bit further about the distribution? For example, can you tell what are p25, p50, p90, p95 and max or describe the distribution in some other way?

I can try different modes with benchmark to see if this is a common issue.

That would be great. Thanks.

Do you think if it's ok to address your comments and include a benchmark in the same pull request?

I suggest to work with a single PR for now. Once we have the full solution we can decide whether it needs to be split into multiple PRs. The first step is to figure out in which cases we have a problem and what's the best way to address all these cases without regressing in other cases.

  1. when the number of duplicates is low (e.g., 2), and 2) when there are a high number of duplicates (e.g., 100 or more).

It would be nice to write the benchmark in a way that allows us to easily test different distributions, e.g. 50% row have 2 dups, 35 have 10 dups, 10 have 50 dups, 5 have 100 dups (or something along these lines).

from velox.

Yuhta avatar Yuhta commented on July 18, 2024

Also would be nice if we have a benchmark for erasing performance. This list will be faster to traverse but slower to update.

from velox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.