Problem statement cudf.pandas has substantially increased the numb

[Story] Enabling prefetching of unified memory about cudf HOT 5 OPEN

vyasr commented on August 25, 2024 1

[Story] Enabling prefetching of unified memory

from cudf.

Comments (5)

davidwendt commented on August 25, 2024

I'd like us to consider an alternate libcudf implementation that is more work but may be better in terms of control and maintenance going forward. I believe we could build a set of utilities that accept pointers or a variety of container types that perform the prefetch and then insert the prefetch/utility calls before each kernel launch. This provides the best control to the algorithm author when and what is prefetched with no surprises or side-effects.

I'd like to keep logic like this out of the containers (column_view and device_uvector). I feel these introduce hidden side-effects that would be difficult to avoid similar to the lazy-null-count logic that was removed several releases ago. I know this is more work but I think having the logic inline with the kernel launches will be easier to maintain and control. We can easily decide which algorithms need prefetching (and when , how, and which parts) and iteratively work on specific chunking solutions in the future without effecting all the other APIs.

from cudf.

vyasr commented on August 25, 2024

I concur with your assessment long term, but as detailed in the issue I don't think it is feasible on the timeline we are seeking. Inserting changes before every kernel launch, even fairly trivial changes, seems like a task that will take at least one full release since the initial work will require achieving consensus on what those changes should be.

Is there something I wrote in the issue that you disagree with? I tried to address pretty much this exact concern in the issue since I share it and anticipated that others would raise it at this point.

from cudf.

davidwendt commented on August 25, 2024

Is there something I wrote in the issue that you disagree with? I tried to address pretty much this exact concern in the issue since I share it and anticipated that others would raise it at this point.

I only disagree with modifying column_view and subclassing device_uvector even in the short term. The first makes me uneasy for the codebase because of its global nature. It likely will not hit all the desired code paths and may cause unnecessary prefetching in other cases (causing more workarounds, etc). The subclassed device_uvector requires a wide change to the codebase on a similar scale that I was proposing so it does not save us that much work.

I'm was hoping that we can add prefetch to a few APIs quickly using a targeted approach with a handful of utilities in the short term and then roll out the rest in the long term.

from cudf.

vyasr commented on August 25, 2024

I'm was hoping that we can add prefetch to a few APIs quickly using a targeted approach with a handful of utilities in the short term and then roll out the rest in the long term.

The problem I see with that approach is that while we might be able to see good results on a particular set of benchmarks that way, we will not be able to enable a managed memory resource as default without substantially slowing down a wide range of APIs (anything that doesn't have prefetching enabled). We should at minimum test running the cudf microbenchmarks with a managed memory resource. I suspect that the results will not support using a managed memory resource by default in cudf.pandas without the more blanket approach for prefetching, unless we choose to wait for the longer term solution where we roll out your proposed changes to more APIs.

from cudf.

vyasr commented on August 25, 2024

Copying from Slack:

We came to the following compromise during the discussion:

We will merge the column_view/mutable_column_view changes from #16020 to allow prefetching to occur on the widest set of APIs possible in the short term.

We will not merge the device_uvector changes because that requires touching many places. Instead, we will find everywhere that we would need to make such changes, and instead insert manual prefetch calls like in #16265. Since that is the long term solution that we prefer anyway, we should do that instead of changing device_uvector since it's the same number of places that need changing. My hope would be that in the short term these would all be prefetches on device_uvectors or device_buffers, the places where we know the above solution has no effect

We will include the prefetch allocator

We will keep the configuration options in place

Over the course of the next couple of months, we will run libcudf benchmarks and cudf Python microbenchmarks using managed memory and identify hot spots that need manual prefetching added. As we do this, we will turn off the column_view prefetching so that we ensure that we're capturing all of the same needs. Once we are satisfied, we will remove prefetching from column_view

I'm going to work on updating 16020 today to remove the undesirable changes, then David and I will aim to get his changes merged in tomorrow

from cudf.

[Story] Enabling prefetching of unified memory about cudf HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent