I have an issue with DBSCAN terminating on large datasets. I'm running the latest NGC

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="39

cuml dbscan terminating on large datasets 'invalid configuration argument' about cuml HOT 12 CLOSED

rapidsai commented on May 19, 2024 1

cuml dbscan terminating on large datasets 'invalid configuration argument'

from cuml.

Comments (12)

dantegd commented on May 19, 2024 2

Thanks @MurrayData ! There are 2 open PRs (#44 and #47) that are doing a lot of improvements for dbscan, so we're definitely working actively on solving this soon.

from cuml.

cjnolet commented on May 19, 2024

@MurrayData, what version of CUDA are you running?

from cuml.

cjnolet commented on May 19, 2024

@teju85 I wonder if this is being caused by the way Cutlass is using TilingStrategy and calculating the resources for the kernels. I was able to get past the error by modifying dbscan.cu to use the tall tiling strategy (503) but now the kernel appears to run indefinitely.

Here's the script I'm using to reproduce this:

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from cuml import DBSCAN as cumlDBSCAN
import cudf
import os

import gzip
def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):
    if os.path.exists(cached):
        print('use mortgage data')
        with gzip.open(cached) as f:
            X = np.load(f)
        X = X[np.random.randint(0,X.shape[0]-1,nrows),:ncols]
    else:
        print('use random data')
        X = np.random.rand(nrows,ncols)
    df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])})
    return df

nrows = 10000000
ncols = 2

X = load_data(nrows,ncols)
print('data',X.shape)

eps = 3
min_samples = 2

X = cudf.DataFrame.from_pandas(X)

clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(X)

from cuml.

MurrayData commented on May 19, 2024

@cjnolet I'm using 10.0 with the corresponding Docker container : rapidsai/rapidsai:cuda10.0_ubuntu16.04

Apologies for delay in replying, I was away for most of the weekend.

from cuml.

teju85 commented on May 19, 2024

@cjnolet The tiling strategy only means that we are still using my fork of cutlass, right? I was under the impression that we were going to use the latest version of cutlass and deprecate my fork of cutlass used here. And this was the plan with PR's #44 and #47, right? Or did I get anything wrong here?

from cuml.

dantegd commented on May 19, 2024

@teju85 yes, that is the plan in the very near future (i.e. version 0.5), with #47 being merged pretty soon, and #44 should follow not much thereafter, though I believe there is still a scalability issue in it and Corey is working on plenty of stuff. @cjnolet feel free to correct me :)

@MurrayData we’ll keep you posted (though if I had to bet, I’d put money on you already saw there is a branch called branch-0.5) in case the refactored dbscan is merged to 0.5 before we squash the bug in the current version.

from cuml.

dantegd commented on May 19, 2024

@cjnolet do you know what is the status on this bug in the refactored dbscan of 0.5?

from cuml.

cjnolet commented on May 19, 2024

@teju85, have you been able to find a culprit for this issue yet?

I think we should push this to 0.6 and focus on #80.

from cuml.

dantegd commented on May 19, 2024

@cjnolet based on the bugs themselves I don’t see any reason why 80 would be higher priority since this is a full blown crash. We can revisit if by the end of the week there is no progress to a solution due to lack of cycles, but i don’t think we should before three.

from cuml.

cjnolet commented on May 19, 2024

#80 is a problem because users are getting incorrect results- in some cases no clusters are found whatsoever when there should be several. This is happening at all scales.

This ticket is related to the algorithm's ability to scale to millions of training examples.

Ideally we would all have the cycles to knock both of these tickets out, however, if I have to choose between the two, I'd personally prioritize correctness over scale.

from cuml.

dantegd commented on May 19, 2024

@cjnolet @teju85 I was wondering if you know the status of this bug currently?

from cuml.

teju85 commented on May 19, 2024

@dantegd Unfortunately, with so many other things happening, this has taken a backseat. May be sometime next week I'll try to see if I can make some progress on these dbscan-related bugs.

from cuml.

cuml dbscan terminating on large datasets 'invalid configuration argument' about cuml HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent