Comments (12)
Thanks @MurrayData ! There are 2 open PRs (#44 and #47) that are doing a lot of improvements for dbscan, so we're definitely working actively on solving this soon.
from cuml.
@MurrayData, what version of CUDA are you running?
from cuml.
@teju85 I wonder if this is being caused by the way Cutlass is using TilingStrategy and calculating the resources for the kernels. I was able to get past the error by modifying dbscan.cu
to use the tall tiling strategy (503) but now the kernel appears to run indefinitely.
Here's the script I'm using to reproduce this:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from cuml import DBSCAN as cumlDBSCAN
import cudf
import os
import gzip
def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):
if os.path.exists(cached):
print('use mortgage data')
with gzip.open(cached) as f:
X = np.load(f)
X = X[np.random.randint(0,X.shape[0]-1,nrows),:ncols]
else:
print('use random data')
X = np.random.rand(nrows,ncols)
df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])})
return df
nrows = 10000000
ncols = 2
X = load_data(nrows,ncols)
print('data',X.shape)
eps = 3
min_samples = 2
X = cudf.DataFrame.from_pandas(X)
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(X)
from cuml.
@cjnolet I'm using 10.0 with the corresponding Docker container : rapidsai/rapidsai:cuda10.0_ubuntu16.04
Apologies for delay in replying, I was away for most of the weekend.
from cuml.
@cjnolet The tiling strategy only means that we are still using my fork of cutlass, right? I was under the impression that we were going to use the latest version of cutlass and deprecate my fork of cutlass used here. And this was the plan with PR's #44 and #47, right? Or did I get anything wrong here?
from cuml.
@teju85 yes, that is the plan in the very near future (i.e. version 0.5), with #47 being merged pretty soon, and #44 should follow not much thereafter, though I believe there is still a scalability issue in it and Corey is working on plenty of stuff. @cjnolet feel free to correct me :)
@MurrayData we’ll keep you posted (though if I had to bet, I’d put money on you already saw there is a branch called branch-0.5
) in case the refactored dbscan is merged to 0.5 before we squash the bug in the current version.
from cuml.
@cjnolet do you know what is the status on this bug in the refactored dbscan of 0.5?
from cuml.
@teju85, have you been able to find a culprit for this issue yet?
I think we should push this to 0.6 and focus on #80.
from cuml.
@cjnolet based on the bugs themselves I don’t see any reason why 80 would be higher priority since this is a full blown crash. We can revisit if by the end of the week there is no progress to a solution due to lack of cycles, but i don’t think we should before three.
from cuml.
#80 is a problem because users are getting incorrect results- in some cases no clusters are found whatsoever when there should be several. This is happening at all scales.
This ticket is related to the algorithm's ability to scale to millions of training examples.
Ideally we would all have the cycles to knock both of these tickets out, however, if I have to choose between the two, I'd personally prioritize correctness over scale.
from cuml.
@cjnolet @teju85 I was wondering if you know the status of this bug currently?
from cuml.
@dantegd Unfortunately, with so many other things happening, this has taken a backseat. May be sometime next week I'll try to see if I can make some progress on these dbscan-related bugs.
from cuml.
Related Issues (20)
- Failing test on CI `test_dask_kneighbors_classifier::test_predict_proba` HOT 1
- Clustering problems with NLP and CUML HOT 2
- Clustering does not accept input from CountVectorizer or TfidfVectorizer HOT 3
- [BUG] AttributeError: 'DataFrame' object has no attribute 'unique' HOT 4
- [QST] Remove training data from UMAP. Save UMAP for future use by joblib HOT 2
- [FEA]Request for Type Stubs Package for cuML to Enhance Developer Experience HOT 3
- Automate C++ include file grouping and ordering using clang-format
- [BUG] cuml cannot split dataframe with string column. 5676 HOT 1
- [BUG]Unexpectedly High Forecast Values in Batch Prediction with cuML's Auto ARIMA HOT 6
- Replace device_memory_resource* with device_async_resource_ref
- [FEA] python 3.12 support HOT 2
- [BUG] Issue installing cuML in Databricks HOT 2
- [BUG] Building from source fails when linking HOT 2
- [DOC] Better example of dask.neighbors in README HOT 8
- [BUG] No Raw Allocation!
- [BUG] UserWarning: Error getting driver and runtime versions: HOT 1
- [BUG] Devcontainer 11.8 image base doesn't exist HOT 1
- Why cuml=24.04 cannot be found? HOT 5
- [QST] Version matching problem about python3.7 HOT 3
- Getting all cuml tests to pass with cudf.pandas enabled HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cuml.