rapidsai / cuml Goto Github PK

cuML - RAPIDS Machine Learning Library

Home Page: https://docs.rapids.ai/api/cuml/stable/

License: Apache License 2.0

CMake 0.82% C++ 31.48% Cuda 29.16% Shell 0.33% Python 23.05% C 0.31% Jupyter Notebook 2.79% Cython 12.01% HTML 0.02% Dockerfile 0.01%

machine-learning-algorithms machine-learning cuda gpu nvidia

cuml's People

Contributors

Stargazers

Watchers

Forkers

vishalbelsare dantegd bdgowda1 cuulee kimsoohwan chomolungma mbrukman devavret jichenghu cjnolet just4jc dailyactie raydouglass rongou ayushdg burlachenkok juanjavier101 teju85 zhouyonglong chr1st0p locussam lyk125 roijo sebastiani kevinpsguy xieliaing yuhonghong66 jmscraig jirikraus nagula-ritvika josiahsams yjk21-nv robocopnixon mehdiabbanabennani ravigumm neerajvashistha yjk21 nsharan1111 payviacoin tfeher myrtw rietmann-nv nickleaf levibarnes akkamesh minseokl akshaysubr milakov quasiben oyilmaz-nvidia salonijain27 chirayug-nvidia lamperougeyxy peterflaming shekharrajak aunaik kkraus14 uptodiff dillon-cullinan turgunyusuf rlratzel igordzreyev benzei vinaydes xclmj muellren fullstackhan hubokitty progschj jgqwhucs mcwrinn viclafargue johnzed venkywonka rfinkelberg adamjm trungthanhnguyen0502 stjordanis ksangeek canonizer gvaihir divyegala temp3rr0r emilybuffy deltadu ziiin lucyjimenez aphilipnv seunghwak millerhooks paul0m rajkaramchedu-nvidia okoskinen leaopensador abc99lr chaoso danielhanchen mluukkainen egoolish pacman1199

cuml's Issues

Remove notebooks from cuML, and depend on repo

All notebooks should be aggregated here: https://github.com/rapidsai/notebooks

where they can be properly tracked and controlled, rather than in the python directory (cuml/python)

Dockerfile outdated after new build process

I'm working on a fix for this now.

[BUG] DBSCAN appears to have a memory leak

Describe the bug
The current DBSCAN implementation appears to have a memory leak that builds up over multiple different runs of the algorithm.

k-NN notebook in CUDA 10 container fails, work-around: rebuild faiss-gpu inside container

Problem: The knn_demo.ipynb included in the CUDA 10 version of RAPIDS container fails on cell 9 (calling knn_cuml.fit(X)) with the following traceback:

AttributeError Traceback (most recent call last)
in
/conda/envs/rapids/lib/python3.5/site-packages/cuml-0+unknown-py3.5-linux-x86_64.egg/cuml.cpython-35m-x86_64-linux-gnu.so in cuml.KNN.fit()
AttributeError: module 'faiss' has no attribute 'StandardGpuResources'

Work-around:

Here are the steps to take inside the nvcr.io/nvidia/rapidsai/cuda10.0_ubuntu16.04 container:

as jupyter user inside container:

source activate rapids
conda uninstall -y faiss-gpu
conda install -y mkl-include=2018.0.3
conda install -y swig=3.0.12
git clone -b v1.4.0 https://github.com/facebookresearch/faiss.git
cd faiss
LDFLAGS="-L${CONDA_PREFIX}/lib" ./configure --prefix=$CONDA_PREFIX --with-python=$(which python)
sed -i 's|PYTHONCFLAGS = -I|PYTHONCFLAGS= -I/conda/envs/rapids/include/python3.5m/ -I/conda/envs/rapids/lib/python3.5/site-packages/numpy/core/include|g' ./makefile.inc
sed -i '/-gencode arch=compute_61,code="compute_61" \/a -gencode arch=compute_70,code="compute_70" \' ./makefile.inc
make install
cd gpu
make
make cpu && make gpu

Then from another shell, need to perform the following as root in the container:

docker ps <-- identify container-id
docker exec -it -u root container-id bash
source activate rapids
cd /rapids/notebooks/faiss/python
python setup.py install

[TASK] New Folder structure with new unified build system

New folder structure with unified build system for ml-prims and cuml.

Known problem building cuML when cuDF is installed with conda

Building cuML from source in an environment where cuDF was installed also building from source works fine, and installing both with conda also works well.

This issue refers to problems building cuML from source in a conda environment where cuDF was installed using conda install, and can happen when using non conda environments as well. In such an environment, libcuml is still installed to the environment lib folder, but if cudf was installed with conda install, cuML's setup.py will look for libcuml in site-packages instead, making the cythonization process fail like this:

$ python setup.py build_ext --inplace
cuML/cuml.pyx: cannot find cimported module 'c_tsvd'
cuML/cuml.pyx: cannot find cimported module 'c_kmeans'
cuML/cuml.pyx: cannot find cimported module 'c_pca'
cuML/cuml.pyx: cannot find cimported module 'c_dbscan'

Currently working on a solution.

[FEA] cuML to expose a "proper" CUDA API

Is your feature request related to a problem? Please describe.
We currently are not exposing the following things from our C/C++ API:

cudaStream
cublasHandle_t and cusolverDnHandle_t
custom memory allocators

The advantages of doing these are:

performance
tighter control over job scheduling and resource allocation from the wrapping library itself
we'll be in unison with other cuda libraries. So, lesser ramp-up curve for our users

Describe the solution you'd like
One solution can be to:

expose a cumlHandle_t structure (just like cudnn/cublas/cufft/cusolver).
give users an ability to set and get above handles/streams/allocators.
make all of our exposed methods in cuML to accept this object.

Describe alternatives you've considered
There are no alternatives currently.

Additional context
None.

Note
Just like #77 , I'm mostly filing this issue so that it doesn't slip away. Please feel free to set the priority for this accordingly, @datametrician @dantegd .

[QST] Is cuML weak to deal with large columns dataset?

What is your question?
when I try small columns dataset (shape=[10000,1000]) on cuML PCA, it work like charm, GPU utilization rate is high.
when I try large columns dataset (shape=[10000,10000]) on cuML PCA, it seems like a disaster to GPU.
utilization rate in the beginning is high (hit 100%), after 2 second, the ratio is super low (0~3%) lasting more than 5 minutes, and no matter svd_solver='randomized' or svd_solver='full', it seems no difference, is that a common situation for cuML? It seems like the gpu is doing nothing after 2 seconds, and cpu (single process hit 100% util rate) is busy to moving data in and out lasting more than 5 minutes.
(code as below)

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA as skPCA
import os
import multiprocessing
import cudf
import cuml
from cuml import PCA as cumlPCA  

def load_data(nrows, ncols):
    print('use random data')
    X = np.random.rand(nrows,ncols)
    df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])})
    return df

nrows = 10000
ncols = 10000
X = load_data(nrows,ncols)

n_components = 2
whiten = False
random_state = 42
# no performance difference between "randomized" and "full" svd_solver
svd_solver="randomized" 
pca_cuml = cumlPCA(n_components=n_components,svd_solver=svd_solver, 
            whiten=whiten, random_state=random_state)
result_cuml = pca_cuml.fit_transform(X)
# it cuase about 5.5 mins on 2080Ti GPU.

provided example E2E.ipynb in container fails on dxgb_gpu.train

Failing with error:

IndexError Traceback (most recent call last)
in

/conda/envs/gdf/lib/python3.5/site-packages/dask_xgboost-0.1.5-py3.5.egg/dask_xgboost/core.py in train(client, params, data, labels, dmatrix_kwargs, **kwargs)
229 """
230 return client.sync(_train, client, params, data,
--> 231 labels, dmatrix_kwargs, **kwargs)
232
233

/conda/envs/gdf/lib/python3.5/site-packages/distributed-1.23.3-py3.5.egg/distributed/client.py in sync(self, func, *args, **kwargs)
645 return future
646 else:
--> 647 return sync(self.loop, func, *args, **kwargs)
648
649 def repr(self):

/conda/envs/gdf/lib/python3.5/site-packages/distributed-1.23.3-py3.5.egg/distributed/utils.py in sync(loop, func, *args, **kwargs)
275 e.wait(10)
276 if error[0]:
--> 277 six.reraise(*error[0])
278 else:
279 return result[0]

/conda/envs/gdf/lib/python3.5/site-packages/six.py in reraise(tp, value, tb)
691 if value.traceback is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None

/conda/envs/gdf/lib/python3.5/site-packages/distributed-1.23.3-py3.5.egg/distributed/utils.py in f()
260 if timeout is not None:
261 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262 result[0] = yield future
263 except Exception as exc:
264 error[0] = sys.exc_info()

/conda/envs/gdf/lib/python3.5/site-packages/tornado/gen.py in run(self)
1131
1132 try:
-> 1133 value = future.result()
1134 except Exception:
1135 self.had_exception = True

/conda/envs/gdf/lib/python3.5/asyncio/futures.py in result(self)
292 self._tb_logger = None
293 if self._exception is not None:
--> 294 raise self._exception
295 return self._result
296

/conda/envs/gdf/lib/python3.5/site-packages/tornado/gen.py in wrapper(*args, **kwargs)
324 try:
325 orig_stack_contexts = stack_context._state.contexts
--> 326 yielded = next(result)
327 if stack_context._state.contexts is not orig_stack_contexts:
328 yielded = _create_future()

/conda/envs/gdf/lib/python3.5/site-packages/dask_xgboost-0.1.5-py3.5.egg/dask_xgboost/core.py in _train(client, params, data, labels, dmatrix_kwargs, **kwargs)
135 label_parts = None
136 if isinstance(data, (list, tuple)):
--> 137 if isinstance(data[0], Delayed):
138 for data_part in data:
139 if not isinstance(data_part, Delayed):

IndexError: list index out of range

[BUG] Kmeans and Tsvd test seem to be sensitive to random number generation

Describe the bug
3 cuML tests seem to sensitive to random number generation because they are failing with GCC 7.3.0:

[==========] 56 tests from 20 test cases ran. (1502 ms total)
[  PASSED  ] 53 tests.
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] KmeansTests/KmeansTestF.Fit/0, where GetParam() = 16-byte object <02-00 00-00 CD-CC 4C-3D 04-00 00-00 02-00 00-00>
[  FAILED  ] KmeansTests/KmeansTestD.Fit/0, where GetParam() = 24-byte object <02-00 00-00 00-00 00-00 9A-99 99-99 99-99 A9-3F 04-00 00-00 02-00 00-00>
[  FAILED  ] TsvdTests/TsvdTestDataVecF.Result/0, where GetParam() =

With GCC 7.1.1 the tests work fine.

Steps/Code to reproduce bug
Build cuML with GCC 7.3.0 and run ml_test

Expected behavior
All tests are passing

Environment details (please complete the following information):

Environment location: Bare-metal
Linux Distro/Architecture: Ubuntu 18.04.1 LTS amd64
GPU Model/Driver: Quadro GV100 and 410.45
CUDA: 10.0.130
Method of cuML install: from source
- commit hash of build: 2bc0dc6
cmake: 3.13.1
gcc/g++: gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0

[BUG] DBSCAN min_samples default does not match SKLearn

cuml default is 1
sklearn default is 5

E2E.ipynb fails with 8 workers on an 8 GPU server

TL;DR: 8 workers fails 4 workers succeeds on an 8 GPU server.

With the full mortgage dataset all dask workers were dying. Out of GPU memory would be the obvious assumption.
4 partitions/files works. But that means only 4 GPUs on an 8 GPU server.
I then removed the extra _1 files for 2001 and in that case 4 dask workers would die in the ETL.
I reduced each of the 2001 files to 875MB which is about the size of the smaller year 2000 files. The ETL then runs and no workers are lost. But as a check gpu_dfs[0].result() failed with an exception:

Exception: GDF_VALIDITY_UNSUPPORTED.  The traceback refers to an error with everdf.group.max().

In reducing the file I removed all values of a given loan #, never splitting them.
5. Without checking, the XGBoost DMatrix conversion appears to run with a wall time of 6.18s.
6. But the GPU XGBoost train fails with the exception and traceback and traceback shown below.
TypeError: reraise() missing 2 required positional arguments: 'tp' and 'value'

Improve exception safety with smart pointers

Would you like to wrap any pointer data members with the class template “std::unique_ptr”?

Update candidates:

Multiple alignment conflict warnings in kmeans during compilation

In the current version of kmeans, there are multiple alignment conflicts that cause warnings to be raised during compilation of the style:

warning: specified alignment (4) is different from alignment (8) specified on a previous declaration
          detected during instantiation of "void kmeans::detail::matmul(const float_t *, const float_t *, float_t *, float_t, float_t, int, int, int, int) [with float_t=float]"

Besides the warnings, there is the potential that this might cause problems in the future so it is worth looking into the conflicts in the kmeans code.

[DOC] Add doc entry in README to create Eclipse NSight project inside of cuml c++

While setting up my workstation to debug through CUDA code, I found a script in [1] that has proven really useful for creating an eclipse project file. Before finding this script, I made several unsuccessful attempts at creating projects that either wouldn't build properly, wouldn't analyze/index the code properly, or wouldn't run/debug. Building the eclipse project file from the cmake command itself worked.

It would be very useful to the community if we included this command in our documentation to enable more potential contributors to cuML. NSight already comes with the CUDA toolkit, thus we can assume any developers wanting to build our repository already have it installed.

[1] https://github.com/rickyzhang82/cs344/blob/master/auto-generate-project.sh

DBSCAN produces different number of clusters using cuML compared to sklearn

DBSCAN generates different # of clusters when using cuML compared to when using sklearn.

Dataset to reproduce:
https://github.com/PatWalters/gpu_kmeans/blob/master/fp.csv

Code to reproduce:

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from cuml import DBSCAN as cumlDBSCAN
import cudf
import os

X = pd.read_csv("fp.csv")
print('data',X.shape)

eps = 3
min_samples = 2

clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)
print("# of sklearn clusters", len(set(clustering_sk.labels_)))

X = cudf.DataFrame.from_pandas(X)
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(X)
print("# of cuML clusters", clustering_cuml.labels_.unique_count())

Document datatype detection (Was: Dataframes with dtype = np.float64 give incorrect results)

I noticed this when working with a Dataframe that has columns of type np.float64. It looks like the wrappers underneath expect a single precision float * and there's no explicit casting going on.

As a result, the calculations resulting from the c code are incorrect because the pointers are being treated as single precision. This isn't necessarily a bug but it should be documented somewhere so that users know to expect this behavior. Otherwise, it could cause some headaches and slow adoption.

cuml KMeans randomly terminates with 'thrust::system::system_error' on larger datasets/no of clusters

I pulled the new Rapids Docker container particularly to re-run a KMeans exercise on Twitter location data that I've previously run successfully in both TensorFlow and Scikit-Learn.

from cuml import KMeans as km
import cudf

names = ['0','1']
dtypes = ['float64','float64']
filename = "/data/twitter/cluster_points.csv"

clustering_cuml = km(n_clusters=100)
clustering_cuml.fit(gdf)

The gdf looks like this:

                    0                  1
 0          392159.91           223933.2
 1          434359.54          278703.86
 2  436988.1599999999          335566.98
 3 386173.63999999996 349452.80000000005
 4          275936.06  674298.0899999999
 5          432248.25          444924.63
 6 458423.64999999997  304714.0300000001
 7          591923.55          120227.19
 8          532864.35          182221.79
 9 336145.64999999997          390272.31
[10023370 more rows]

And in the Docker container it hangs. I noticed 2 things by monitoring the system in another window:

The GPU is not utilised, it remains at 0% in nvidia-smi
A single core of the CPU runs at 100% for quite some time and then the process stops but the notebook doesn't detect this.
The GPU RAM is freed upon termination, including the dataframe.

The log shows:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  trivial_device_copy D->H failed: invalid argument

I tried then tried using a smaller sample of just 10,000 records and the same thing happened.

I reduced the number of clusters to 10 and it worked fine. I then steadily increased the number of records back up to 10 million and found it would intermittently hang on some runs above 12 clusters, but then work other times. I then increased the number of clusters and found I could never get it to work beyond 20 clusters.

It's a difficult error to reproduce as it appears random although happens consistently on larger datasets. I always shutting down all kernels and restarting each time for a clean environment.

I've also tried it on different computers with different GPUs: GV100, Titan V and Titan Xp and experienced the same issue.

I also tried it outside Docker and the same thing happened.

kNN to support all FAISS GPU indexes

Currently, the kNN implementation within cuml's python layer only supports the IndexFlatL2 index. It would be ideal if cuml could somehow support the other index types without tying the user-facing API too closely to FAISS.

One way to implement this might be for the API layer to provide a pluggable strategy for "index_alg" that would call the necessary index function in FAISS when invoked.

Support builds for Turing architecture

@dantegd this is probably a continuation work after the build-related changes are merged from PR #42 . If we detect cuda v10.0, we should also support building for Turing arch too. Filing this issue, so that it doesn't slip through.

[DISCUSS] Remove gtest from external/ and depend on shared object

GTest is a library that is available in standard repositories in common Linux package managers.

Currently, the cuml codebase ships with the gtest code included in the external/ directory but it might be easier for users if we follow the path we took for the FAISS integration and simply make it a dependency for our build. In the case of cuML being packaged up for install with aptitude or yum, the gtest dependency would be installed automatically.

I opened this ticket to discuss our options and bring to light any possible reasons why removing gtest from external would be a bad idea for sustainability.

cuml dbscan terminating on large datasets 'invalid configuration argument'

I have an issue with DBSCAN terminating on large datasets. I'm running the latest NGC Rapids Docker container. I've seen the comments in #31.

[I 14:36:16.181 LabApp] Kernel started: d712459f-cf0a-4c0e-a6ec-ecc73b9e91f1
[I 14:36:16.667 LabApp] Adapting to protocol v5.1 for kernel d712459f-cf0a-4c0e-a6ec-ecc73b9e91f1
[I 14:37:01.848 LabApp] Saving file at /cuml/dbscan_twitter.ipynb
terminate called after throwing an instance of 'std::runtime_error'
  what():  Exception occured! file=/rapids/cuml/cuML/src/dbscan/vertexdeg/algo5.h line=141: FAIL: call='res.result'. Reason:invalid configuration argument
[I 14:38:40.180 LabApp] KernelRestarter: restarting kernel (1/5), keep random ports
kernel d712459f-cf0a-4c0e-a6ec-ecc73b9e91f1 restarted

I'm using the same Twitter derived point dataset as in #53 and sampling it down until it works. The source data has 10.5 million points. The crashes happen above 5 million rows x 2 columns, row counts below that work fine. Minimum sample size set to 1000 eps to none.

In addition to using the Twitter file, I've also tested it with randomly generated data.

                    0                  1
 0 478401.76952889207  542950.5525448014
 1  454622.9484872194  463117.5441902199
 2 340651.60100943915   568573.833436874
 3  60462.91779449762  248186.3741022621
 4 290905.83582845033  564827.5589121555
 5  305875.7357089389  187773.6372960709
 6 122647.20323430178 444709.74442503956
 7 11336.977964928763 382183.34051422554
 8 22657.665326672173  527002.8538174401
 9 106190.10338763308 428359.52118789754
[9999990 more rows]

[TASK] Update and test cuML for python 3.7

With cuDF adding support for python 3.7, cuML needs to follow and be tested with the corresponding cuDF build. Waiting on rapidsai/cudf#668

[CUML] Real-world example notebooks

I have built a few real-world examples for a talk on cuML. The notebooks that I created for the talk will be useful to the community. I will submit a PR for this.

dbscan crashed when the data set grow large

GPU used: 1u Tesla P40;
OS and version: Ubuntu 18.04;
CUDA version: 9.2;
Driver: 410.48;
gcc version: 7.3;
python version: 3.5

I have a data set which content 180,914 rows and 48 columns, most of them are integer (from 0 to 10,000):

I convert this full data frame to a float64 data type.
It works well when we use the “sklearn” lib (CPU) to run.
I tried to run use the DBSCAN lib in cuML, it crashed, and no response at all, I have to restart the whole kernel.
Then I tried to reduce the rows in our data sets from 180K to 10K, it works, but very slow, it cost about 3s, and then 20K rows data for 6s, 30K rows for 9s, and then it will crash when the data become to 70K rows.

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from sklearn.datasets.samples_generator import make_blobs
from cuML import DBSCAN as cumlDBSCAN
import pygdf
import os
import dask
#import dask_gdf
from dask.delayed import delayed
from dask.distributed import Client, wait
import pygdf
from pygdf.dataframe import DataFrame
from collections import OrderedDict
from glob import glob
from sklearn.cluster import KMeans
import re
from itertools import cycle
from sklearn.preprocessing import StandardScaler

##LOAD DATA
X = load_data()

eps = 0.5
min_samples = 3

run using sklearn lib

clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)

run using cuML lib

Y = pygdf.DataFrame.from_pandas(X)

clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
Z = Y.head(70000)
clustering_cuml.fit(Z)

Following the directory structure of cudf

@dantegd cudf seems to have recently migrated to using 'cpp' instead of the previous 'libgdf' folder. Does it make sense for us to do similar thing? I think this could be done over-and-above PR #42 . What say?

[DOC] Need docstrings for KNN methods

The methods fit, query, to_cudf and to_nparray are missing docstrings in comparison to the other methods.

[TASK] Add comprehensive validations from sklearn "toy datasets" to our test suite.

@daxiongshu had a great idea to start testing our algorithms against SKLearn's results [1].

A couple discrepancies turned up in DBSCAN. I believe we should be running these comparisons as part of our py.test suite.

[1] https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

[TASK] Update CUTLASS in cuml

The last commit to CUTLASS in cuML's ml-prims/external/cutlass submodule was from June 2018.

The CUTLASS repository has commits from Dec 19, 2018.

It would be a good idea to update this.

Conda install cuml failure

Installing cuML from rapidsai conda failed due to two packages not being found. I am able to install cuDF successfully from conda, though. If I try to do conda install faiss-gpu on its own, I am able to install it into the environment.

Ubuntu 16.04, V100, Cuda 9.2 410.48,

Full error:

(cudf) root@81afa5caf852:/# conda install -c rapidsai cuml 
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - cuml
  - faiss-gpu

Current channels:

  - https://conda.anaconda.org/rapidsai/linux-64
  - https://conda.anaconda.org/rapidsai/noarch
  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/free/linux-64
  - https://repo.anaconda.com/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://repo.anaconda.com/pkgs/pro/linux-64
  - https://repo.anaconda.com/pkgs/pro/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

Multi-GPU PCA Support

I know this might be an ambitious effort, but would the RAPIDS development team be able to provide support for Multi-GPU PCA. Many of the datasets I work with are 30GB+. Being able to reduce the dimensionality of these datasets to more manageable sizes would be useful for not only less computational expense, but perhaps more so sharing/collaboration.

As a test case, I am currently working with HCP (Human Connectome Project) data. Here, I use NumPy to manipulate my data because I am a decent human being. But after my data is all nice and beautiful, I am faced with "Out of Memory" when I try to convert my NumPy array to a Pandas dataframe to a cudf.

The shape of my data I am feeding into PCA here is (120, 63070800). The time to beat is shown below. My colleagues and I have struggled with this PCA problem on different datasets for almost two years now so if you can do us a solid and end our misery a little bit faster, we would be extremely grateful. Thanks again RAPIDS team. You are killing it.

[TASK] Stochastic Gradient Descent

Being developed by @oyilmaz-nvidia with assistance of @dantegd

Unable to run python unit-tests

I'm using at the HEAD of this repository. When I try to run the unit-tests, I get the error at attached at the end. Command to repro:

$ cd python/cuML/test
$ pytest test_pca.py

run.log

[TASK] LASSO (L1) Linear Regression

[TASK] Convert the PR Visual inspection of kmeans and dbscan into a unit test

Convert the great job by @daxiongshu PR #83 into a unit test, at the python level.

[TASK] [DEBT] DBSCAN C++ needs tests for each component

Build gtests for vertex degree, adjacency graph, and labeling components within DBSCAN to aid in debugging performance and correctness problems.

There are naive kernel implementations of each of these components that should provide good baselines. This needs to be verified as well, however.

[FEA] cuML to expose a "proper" C-API

Is your feature request related to a problem? Please describe.
Currently, we directly expose underlying C++ implementations via cython. Since cython can also understand C++ interfaces, all is well for us. But if we need wider adoption, I think we should expose a true C-API (for eg: declaring symbols under "extern C"). Such an interface can then be easily usable across multiple languages.

Describe the solution you'd like
As a first step, we could just start by wrapping our *_c.h files under each algo folder of cuML with "extern C" declarations.

Describe alternatives you've considered
-NA-

Additional context
None.

[FEA] cuML/cuDF support for DLPack

Should cuML/cuDF also support DLPack?
Following another thread on pytorch made me think about this. pytorch/pytorch#15601

I'm a fan of the numba array_interface, but supporting multiple integrations may make it easier to consolidate in the future.

KMeans hangs/core dumps with integer type data

KMeans appears to only work with floats, currently. This may be known, as the docstring explicitly calls out Kmeans for floats in the naming conventions. We should update the documentation in the short term to reflect this explicitly if it's known behavior.

The example in the docstring works:

from cuml import KMeans
import cudf
import numpy as np
import pandas as pd

def np2cudf(df):
	# convert numpy array to cuDF dataframe
	df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])})
	pdf = cudf.DataFrame()
	for c,column in enumerate(df):
		pdf[str(c)] = df[column]
	return pdf


a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],dtype=np.float32)
b = np2cudf(a)
print("input:")
print(b)

print("Calling fit")
kmeans_float = KMeans(n_clusters=2, n_gpu=-1)
kmeans_float.fit(b)

But the following examples either cause core dumps or hang:

import cudf
import numpy as np
import pandas as pd

def np2cudf(df):
	# convert numpy array to cuDF dataframe
	df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])})
	pdf = cudf.DataFrame()
	for c,column in enumerate(df):
		pdf[str(c)] = df[column]
	return pdf


a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],dtype=np.int32)
b = np2cudf(a)
print("input:")
print(b)

print("Calling fit")
kmeans_float = KMeans(n_clusters=2, n_gpu=-1)
kmeans_float.fit(b)

from cuml import KMeans
import cudf

cdf = cudf.DataFrame()

cdf['a'] = [1,2,3]
cdf['b'] = [6,1,2]
cdf['c'] = [1,2,4]
cdf['d'] = [9,2,100]

kmeans_float = KMeans(n_clusters=2, n_gpu=-1)
kmeans_float.fit(cdf)

System Info

Ubuntu 16.04, Cuda 9.2, 410.48, V100

fit or fit_transform

pca_demo.ipynb and tsvd_demo.ipynb

%%time
pca_sk = skPCA(n_components=n_components,svd_solver=svd_solver,
whiten=whiten, random_state=random_state)
result_sk = pca_sk.fit_transform(X)

%%time
algorithm='arpack'
tsvd_sk = skTSVD(n_components=n_components,algorithm=algorithm,
random_state=random_state)
result_sk = tsvd_sk.fit_transform(X)

I found that using ‘fit_transform‘’ will not report an error, using ‘fit’ will not

cuml Kmeans clustering unexpected behavior not consistent with Sklearn / R's Kmeans

cuML Kmeans results in unexpected clustering behavior compared to sklearn and R's stats package. Basic reproducible example below. The result appears to be due to the fundamentally different clusters, not a mis-assignment of records to cluster IDs.

System Info: Ubuntu 16.04, Cuda 92 410.48, V100 GPU, Python 3.5.6

"""KMeans testing cuml vs sklearn
"""

from cuml import KMeans as cumlKMeans
from sklearn.cluster import KMeans
import cudf
import numpy as np
import pandas as pd


cdf = cudf.DataFrame()

cdf['a'] = np.array([3000, 2, 3100.], dtype=np.float32)
cdf['b'] = np.array([3000, 4, 3100.], dtype=np.float32)
cdf['c'] = np.array([4000, 3, 4100.], dtype=np.float32)
cdf['d'] = np.array([3100, 1, 4100.], dtype=np.float32)



cuml_km = cumlKMeans(n_clusters=2)
sk_km = KMeans(n_clusters=2)


# cuML Kmeans results in cluster centers of 
# 0    4.0 3100.0 4000.0    3.0
# 1 3550.0 1551.0 1550.5 3550.0
cuml_km.fit(cdf)
print(cuml_km.cluster_centers_)


# sklearn Kmeans results in cluster centers of 
# 0  3050.0  3050.0  4050.0  3600.0
# 1     2.0     4.0     3.0     1.0
sk_km.fit(cdf.to_pandas())
print(pd.DataFrame(sk_km.cluster_centers_))

The sklearn results are consitent with the results from R's stats package implementation of Kmeans. These results are also consistent across multiple runs with different random seeds.

"""KMeans testing cuml vs sklearn
"""

from cuml import KMeans as cumlKMeans
from sklearn.cluster import KMeans
import cudf
import numpy as np
import pandas as pd


cdf = cudf.DataFrame()

cdf['a'] = np.array([3000, 2, 3100.], dtype=np.float32)
cdf['b'] = np.array([3000, 4, 3100.], dtype=np.float32)
cdf['c'] = np.array([4000, 3, 4100.], dtype=np.float32)
cdf['d'] = np.array([3100, 1, 4100.], dtype=np.float32)



km = cumlKMeans(n_clusters=2)
sk_km = KMeans(n_clusters=2)



cuml_results = []
cuml_centers = []
for i in range(100):
	km = cumlKMeans(n_clusters=2, random_state=np.random.choice(100000))
	res = km.fit_predict(cdf)
	cuml_results.append(res.to_array())
	cuml_centers.append(km.cluster_centers_.to_pandas().mean(axis=1))

print(cuml_centers)


cuml_res_df = pd.DataFrame(cuml_results, columns=['c1', 'c2', 'c3'])
print(cuml_res_df[
	(cuml_res_df.c1 == 1) & (cuml_res_df.c2 == 1)
	| (cuml_res_df.c1 == 0) & (cuml_res_df.c2 == 0)
	].shape)


sklearn_results = []
for i in range(1000):
	res = sk_km.fit_predict(cdf.to_pandas())
	sklearn_results.append(res)

sk_res_df = pd.DataFrame(sklearn_results, columns=['c1', 'c2', 'c3'])
print(sk_res_df[
	(sk_res_df.c1 == 1) & (sk_res_df.c2 == 1)
	| (sk_res_df.c1 == 0) & (sk_res_df.c2 == 0)
	].shape)

About 25-40% of the cuml kmeans runs result in records 0 and 1 being in the same cluster, which should not happen or should be vanishingly unlikely (I haven't done the math to see if this is actually a possible stable outcome). None of the sklearn runs result in this pairing.

R example, that matches sklearn:

library(stats)
library(dplyr)

df <- tibble(
  temp = c(3000, 2, 3100.),
  temp2 = c(3000, 4, 3100.),
  temp3 = c(4000, 3, 4100.),
  temp4 = c(3100, 1, 4100.), 
)

cl = kmeans(df, 2)
print(cl$centers)
# 1    2     4     3     1
# 2 3050  3050  4050  3600

Failed to install cuml

Why does cuml always fail to install, when I execute the phrase "make -j"?
The error message is as follows:
31 errors detected in the compilation of "/tmp/tmpxft_000022ed_00000000-15_pca_test.compute_61.cpp1.ii".
CMakeFiles/ml_test.dir/build.make:62: recipe for target 'CMakeFiles/ml_test.dir/test/pca_test.cu.o' failed
make[2]: *** [CMakeFiles/ml_test.dir/test/pca_test.cu.o] Error 2
5 errors detected in the compilation of "/tmp/tmpxft_000022f0_00000000-15_dbscan_test.compute_61.cpp1.ii".
CMakeFiles/ml_test.dir/build.make:88: recipe for target 'CMakeFiles/ml_test.dir/test/dbscan_test.cu.o' failed
make[2]: *** [CMakeFiles/ml_test.dir/test/dbscan_test.cu.o] Error 2
CMakeFiles/Makefile2:73: recipe for target 'CMakeFiles/ml_test.dir/all' failed
make[1]: *** [CMakeFiles/ml_test.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2.

Building cuML fails if you don't install the RIGHT ZLIB

cuml/cuML/CMakeLists.txt requires that both pthread and z are installed by the OS package manager. There are cmake modules for FindThreads and FindZLIB which can be used to discover the location of the libraries in question, so that they can be linked more generally.

Currently, if a user installed zlib via conda install -c conda-forge zlib, libcuml.so will fail to link because it expects libz.so to be in /usr/local/lib.

https://github.com/rapidsai/cuml/blob/master/cuML/CMakeLists.txt#L123-L130

target_link_libraries(cuml
                      OpenMP::OpenMP_CXX
                      ${CUDA_cublas_LIBRARY}
                      ${CUDA_curand_LIBRARY}
                      ${CUDA_cusolver_LIBRARY}
                      ${CUDA_CUDART_LIBRARY}
                      pthread
                      z)

Cannot import cuml installed using conda because `libNVStrings.so` not found [BUG]

Describe the bug
A clear and concise description of what the bug is.

Steps/Code to reproduce bug

import cudf

Expected behavior
The sample notebook should run as indicated by the docs. At least the import should work

Environment details (please complete the following information):

Environment location: [aws]
Linux Distro/Architecture: [Ubuntu 16.04 amd64]
GPU Model/Driver: [tesla k80 and driver 396.44]
CUDA: [9.2]
Method of cuDF & cuML install: [conda]

# packages in environment at /env/python-custom:
#
# Name                    Version                   Build  Channel
arrow-cpp                 0.10.0           py36h70250a7_0    conda-forge
blas                      1.0                         mkl  
boost-cpp                 1.67.0               h3a22d5f_0    conda-forge
bzip2                     1.0.6                h470a237_2    conda-forge
ca-certificates           2018.11.29           ha4d7672_0    conda-forge
certifi                   2018.11.29            py36_1000    conda-forge
cffi                      1.11.5           py36h5e8e0c9_1    conda-forge
cudf                      0.4.0                    py36_0    rapidsai
cuml                      0.4.0            cuda9.2_py36_0    rapidsai
cython                    0.28.5           py36hfc679d8_0    conda-forge
faiss-gpu                 1.4.0           py36_cuda8.0.61_1    pytorch
icu                       58.2                 hfc679d8_0    conda-forge
intel-openmp              2019.1                      144  
libcudf                   0.4.0                 cuda9.2_0    rapidsai
libcudf_cffi              0.4.0            cuda9.2_py36_0    rapidsai
libcuml                   0.4.0                 cuda9.2_0    rapidsai
libffi                    3.2.1                hfc679d8_5    conda-forge
libgcc                    7.2.0                h69d50b8_2    conda-forge
libgcc-ng                 7.2.0                hdf63c60_3    conda-forge
libgfortran-ng            7.2.0                hdf63c60_3    conda-forge
libstdcxx-ng              7.2.0                hdf63c60_3    conda-forge
llvmlite                  0.26.0           py36hd28b015_0    conda-forge
mkl                       2018.0.3                      1  
mkl_fft                   1.0.10                   py36_0    conda-forge
mkl_random                1.0.2                    py36_0    conda-forge
ncurses                   6.1                  hfc679d8_2    conda-forge
numba                     0.41.0           py36hf8a1672_0    conda-forge
numpy                     1.15.0           py36h1b885b7_0  
numpy-base                1.15.0           py36h3dfced4_0  
nvstrings                 0.2.0            cuda9.2_py36_0    nvidia
openssl                   1.0.2p               h470a237_2    conda-forge
pandas                    0.20.3                   py36_1    conda-forge
parquet-cpp               1.5.0.pre            h83d4a3d_0    conda-forge
pip                       18.1                  py36_1000    conda-forge
pyarrow                   0.10.0           py36hfc679d8_0    conda-forge
pycparser                 2.19                       py_0    conda-forge
python                    3.6.7                h5001a0f_1    conda-forge
python-dateutil           2.7.5                      py_0    conda-forge
pytz                      2018.9                     py_0    conda-forge
readline                  7.0                  haf1bffa_1    conda-forge
setuptools                40.6.3                   py36_0    conda-forge
six                       1.12.0                py36_1000    conda-forge
sqlite                    3.26.0               hb1c47c0_0    conda-forge
tk                        8.6.9                ha92aebf_0    conda-forge
wheel                     0.32.3                   py36_0    conda-forge
xz                        5.2.4                h470a237_1    conda-forge
zlib                      1.2.11               h470a237_4    conda-forge

Additional context

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/env/python-custom/lib/python3.6/site-packages/cudf/__init__.py", line 2, in <module>
    from cudf import dataframe             # noqa: F401
  File "/env/python-custom/lib/python3.6/site-packages/cudf/dataframe/__init__.py", line 1, in <module>
    from cudf.dataframe import (buffer, dataframe, series,  # noqa: F401
  File "/env/python-custom/lib/python3.6/site-packages/cudf/dataframe/dataframe.py", line 18, in <module>
    from cudf import formatting, _gdf
  File "/env/python-custom/lib/python3.6/site-packages/cudf/_gdf.py", line 13, in <module>
    from libgdf_cffi import ffi, libgdf
  File "/env/python-custom/lib/python3.6/site-packages/libgdf_cffi/__init__.py", line 30, in <module>
    libgdf_api = ffi.dlopen(_get_lib_name())
OSError: cannot load library 'libcudf.so': libNVStrings.so: cannot open shared object file: No such file or directory

[BUG] DBSCAN results incorrect

@daxiongshu ran our DBSCAN & k-means implementations against [1] and found that our results do not match, even for datasets as small as size 2^10.

[1] https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

[TASK] Kalman Filter algorithms: linear

[FEA] Removing "randomized" solver option from PCA and tSVD

Is your feature request related to a problem? Please describe.
Randomized option is not supported due to its high discrepancy. It was added to be compatible with SKL.

Describe the solution you'd like
Should be removed from python wrapper.

[TASK] [DEBT] Use GLOG in the cuML's back-end

I have spoken with some of the other back-end developers of cuML about this. GLOG is Google's logging tool for C++.

Similar to other logging tools, it allows us to set a log level at run-time in order to debug cuML algorithms without having to rebuild each time.

It will also allow users of the community to drop the log level to debug when they experience problems and provide their output on github issues so we can help isolate problems more easily.

NVStrings should be compiled to support more platforms

We should drop the requirement on GCC>=5.4.0, if, at all possible.

[TASK] Add license header check to build

As both the code base and community in cuML (and RAPIDS AI in general) continue to grow, it would be useful to add a license header check to our build.

Most often, tools that do this will allow you to white-list a set of file extensions that will be checked.

From a cursory look through the code base, we will definitely want to check the extensions: ["py", "cu", "c", "h", "cfg", "sh"]

It would also make sense to blacklist directories (e.g. external/).

Perhaps it could be as simple as having a python script that runs (and places headers in the appropriate format with the appropriate extensions).

There are tools out there to help with this too [1]. We would need to decide whether we want our tool to change the files or just alert when files exist without license headers. If we go the latter route, we should probably add this to our travis build to make it easier for contributors (and reviewers).

[1] https://github.com/malukenho/docheader