Giter VIP home page Giter VIP logo

oap-mllib's Introduction

* LEGAL NOTICE: Your use of this software and any required dependent software (the "Software Package") is subject to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party or open source software included in or with the Software Package, and your use indicates your acceptance of all such terms. Please refer to the "TPP.txt" or other similarly-named text file included with the Software Package for additional details.
* Optimized Analytics Package for Spark* Platform is under Apache 2.0 (https://www.apache.org/licenses/LICENSE-2.0).

OpenSSF Scorecard OpenSSF Best Practices

Introduction

The Problem

Apache Spark MLlib is a scalable machine learning library based on Spark unified platform. It seamlessly integrates with Spark SQL, Spark Streaming and other machine learning and deep learning frameworks without additional glue code for the entire pipeline.

However, JVM-based MLlib only has limited use of BLAS acceleration and Spark shuffle is also slow for communication during distributed training. Spark's original design is CPU-centric that can't leverage GPU acceleration. It doesn't fully utilize modern CPU and GPU capabilities to achieve best performance.

OAP MLlib Solution

OAP MLlib is a platform optimized package to accelerate machine learning algorithms in Apache Spark MLlib. It is compatible with Spark MLlib and leverages open source Intel® oneAPI Data Analytics Library (oneDAL) to provide highly optimized algorithms and get most out of CPU and GPU capabilities. It also take advantage of open source Intel® oneAPI Collective Communications Library (oneCCL) to provide efficient communication patterns in multi-node multi-GPU clusters.

Who will use OAP MLlib

This solution is intended for researchers, data scientists and enterprise users to accelerate their Spark MLlib algorithms with minimum configuration changes.

Architecture

The following diagram shows the high-level architecture of OAP MLlib.

OAP MLlib Architecture

OAP MLlib maintains the same API interfaces with Spark MLlib. Both Python and Scala languages are supported. That means the application built with Spark MLlib can be running directly with minimum configuration.

Most of the algorithms can produce the same results that are identical with Spark MLlib. However due to the nature of distributed float point operations, there may be some small deviation from the original result, we will make sure the error is within acceptable range and the accuracy is on par with Spark MLlib.

For those algorithms that are not accelerated by OAP MLlib, the original Spark MLlib one will be used.

Online Documentation

You can find the all the OAP MLlib documents on the project web page.

Getting Started

Python/PySpark Users Preferred

Use a pre-built JAR to get started. If you have finished OAP Installation Guide, you can find compiled OAP MLlib JAR oap-mllib-x.x.x.jar in $HOME/miniconda2/envs/oapenv/oap_jars/.

Then you can refer to the following Running section to try out.

Java/Scala Users Preferred

Use a pre-built OAP MLlib JAR to get started, you can download OAP MLlib JAR from Release Page.

Then you can refer to the following Running section to try out.

Building From Scratch

You can also build the package from source code, please refer to Building Code section.

Running

Prerequisites

  • Generally, our common system requirements are the same with Intel® oneAPI Toolkit, please refer to Intel® oneAPI Base Toolkit System Requirements for details.

  • Please follow this guide to install Intel® oneAPI Runtime Library Packages using package managers. The following runtime packages with all their dependencies should be installed in all cluster nodes:

    intel-oneapi-runtime-dal
    intel-oneapi-runtime-ccl
    
  • (Optional) If you plan to use Intel GPU, install the Intel GPU drivers. Otherwise only CPU is supported.

Supported Spark Versions

  • Apache Spark 3.1.1
  • Apache Spark 3.1.2
  • Apache Spark 3.1.3
  • Apache Spark 3.2.0
  • Apache Spark 3.2.1
  • Apache Spark 3.2.2
  • Apache Spark 3.3.3

Supported Intel® oneAPI Toolkits

  • Intel® oneAPI Toolkits >= 2023.1

Spark Configuration

General Configuration

Standalone Cluster Manager

For using standalone cluster manager, you need to upload the jar to every node or use shared network folder and then specify absolute paths for extraClassPath.

# absolute path of the jar for driver class path
spark.driver.extraClassPath       /path/to/oap-mllib-x.x.x.jar
# absolute path of the jar for executor class path
spark.executor.extraClassPath     /path/to/oap-mllib-x.x.x.jar
YARN Cluster Manager

For users running Spark application on YARN with client mode, you only need to add the following configurations in spark-defaults.conf or in spark-submit command line before running.

# absolute path of the jar for uploading
spark.files                       /path/to/oap-mllib-x.x.x.jar
# absolute path of the jar for driver class path
spark.driver.extraClassPath       /path/to/oap-mllib-x.x.x.jar
# relative path of the jar for executor class path
spark.executor.extraClassPath     ./oap-mllib-x.x.x.jar

Note: Intel GPUs are not fully supported in YARN Cluster Manager, please use Standalone mode.

OAP MLlib Specific Configuration

spark.oap.mllib.device is used to select compute device, you can set it as CPU or GPU. Default value is CPU if it's not specified. Please check List of Accelerated Algorithms for supported algorithms of each compute device.

OAP MLlib adopted oneDAL as implementation backend. oneDAL requires enough native memory allocated for each executor. For large dataset, depending on algorithms, you may need to tune spark.executor.memoryOverhead to allocate enough native memory. Setting this value to larger than dataset size / executor number is a good starting point.

OAP MLlib expects 1 executor acts as 1 oneCCL rank for compute. As spark.shuffle.reduceLocality.enabled option is true by default, when the dataset is not evenly distributed accross executors, this option may result in assigning more than 1 rank to single executor and task failing. The error could be fixed by setting spark.shuffle.reduceLocality.enabled to false.

Sanity Check

Setup env.sh

    $ cd conf
    $ cp env.sh.template env.sh

Edit related variables in "Minimun Settings" of env.sh

Run K-means

    $ cd examples/python/kmeans-pyspark
    $ ./run-cpu.sh

Building Code

Prerequisites

We use Apache Maven to manage and build source code. The following tools and libraries are also needed to build OAP MLlib:

  • JDK 8.0+
  • Apache Maven 3.6.2+
  • GNU GCC 7+
  • Intel® oneAPI Base Toolkit Components:
    • DPC++/C++ Compiler (icpx)
    • Data Analytics Library (oneDAL)
    • Threading Building Blocks (oneTBB)
    • Collective Communications Library (oneCCL)

Generally you only need to install Intel® oneAPI Base Toolkit for Linux with all or selected components mentioned above. Intel® oneAPI Base Toolkit can be downloaded and installed from here. Installation process for oneAPI using Package Managers (YUM (DNF), APT, and ZYPPER) is also available. More details about oneAPI can be found here.

Scala and Java dependency descriptions are already included in Maven POM file.

Note: You can refer to this script to install correct dependencies.

Build

To clone and checkout source code, run the following commands:

    $ git clone https://github.com/oap-project/oap-mllib.git

Optional to checkout specific release branch:

    $ cd oap-mllib && git checkout ${version}

We rely on environment variables to find required toolchains and libraries. Please make sure the following environment variables are set for building:

Environment Description
JAVA_HOME Path to JDK home directory
DAALROOT Path to oneDAL home directory
TBB_ROOT Path to oneTBB home directory
CCL_ROOT Path to oneCCL home directory

We suggest you to source setvars.sh script into current shell to setup building environments as following:

    $ source /opt/intel/oneapi/setvars.sh

You can also refer to this CI script to setup the building environments.

If you prefer to buid your own open source oneDAL, oneTBB, oneCCL versions rather than use the ones included in oneAPI Base Toolkit, you can refer to the related build instructions and manually source setvars.sh accordingly.

To build, run the following commands:

    $ cd mllib-dal
    $ ../dev/prepare-build-deps.sh
    $ ./build.sh

The built JAR package will be placed in target directory with the name oap-mllib-x.x.x.jar.

Examples

Python Examples

Example Description
kmeans-pyspark K-means example for PySpark
pca-pyspark PCA example for PySpark
als-pyspark ALS example for PySpark
random-forest-classifier-pyspark Random Forest Classifier example for PySpark
random-forest-regressor-pyspark Random Forest Regressor example for PySpark
correlation-pyspark Correlation example for PySpark
summarizer-pyspark Summarizer example for PySpark

Scala Examples

Example Description
kmeans-scala K-means example for Scala
pca-scala PCA example for Scala
als-scala ALS example for Scala
naive-bayes Naive Bayes example for Scala
linear-regression-scala Linear Regression example for Scala
correlation-scala Correlation example for Scala
summarizer-scala Summarizer example for Scala

Note: Not all examples have both CPU or GPU version, please check List of Accelerated Algorithms section.

List of Accelerated Algorithms

Algorithm CPU GPU
K-Means X X
PCA X X
ALS X
Naive Bayes X
Linear Regression X X
Ridge Regression X
Random Forest Classifier X
Random Forest Regressor X
Correlation X X
Summarizer X X

oap-mllib's People

Contributors

adrian-wang avatar argentea avatar carsonwang avatar chenghao-intel avatar eugene-mark avatar gfl94 avatar haojinintel avatar happycherry avatar hongw2019 avatar ivoson avatar jerrychenhf avatar jikunshang avatar jkself avatar lee-lei avatar lidinghao avatar luciferyang avatar marin-ma avatar minmingzhu avatar moonlit-sailor avatar rui-mo avatar shaowenzh avatar songzhan01 avatar tigersong avatar xuechendi avatar xwu99 avatar yao531441 avatar yma11 avatar zhixingheyi-tian avatar zhouyuan avatar zhztheplayer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

oap-mllib's Issues

[ALS] Use user ID and item ID instead of matrix indices for ALS

Current ALS data input is using row index and column index to index rating in rating matrix. In real use cases, those row index and column index are usually using user ID and item ID and are represented with Integer or String. For example:

Use string as id

“User1”, “Item1”, 1.0
“User2”, “Item2”, 2.0

Or use integer as id

1234, 4567, 1.0
4321, 5678, 2.0

=> Row / column index for oneDAL

0, 1, 1.0
1, 1, 2.0

Other framework such as Spark MLlib ALS will handle this string/integer ID out of box. We need to do an extra data step to map from UserID/ItemID to row index/column index before calling DAL ALS and map back.

Also refer to oneapi-src/oneDAL#1514

Reorganize Spark version specific code structure

To act as the template using profile based compile time source code organization, we need a more standard way to organize the version specific source code. Here is the proposal:

project-root
--src
----main -> common code entry
----test -> common test code entry
--spark-3.0.1 -> version specific entry
----main -> version specific code entry
----test -> version specific test code entry
--spark-3.1.1
----main
----test
...

And we will use build-helper-maven-plugin to add additional version specific source code to build.
Additionally, as part of this refactor, change maven-scala-plugin (old) to scala-maven-plugin (new)

[Release] Error when installing intel-oneapi-dal-devel-2021.1.1 intel-oneapi-tbb-devel-2021.1.1

#  sh dev/install-build-deps-centos.sh
Installing oneAPI components ...
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * base: ftp.sjtu.edu.cn
 * epel: epel.mirror.angkasa.id
 * extras: ftp.sjtu.edu.cn
 * updates: mirror.lzu.edu.cn
oneAPI/signature                                                                                                                                                                                                                                                |  287 B  00:00:00
oneAPI/signature                                                                                                                                                                                                                                                | 1.5 kB  00:00:00 !!!
Resolving Dependencies
--> Running transaction check
---> Package intel-oneapi-dal-devel.x86_64 0:2021.1.1-79 will be installed
---> Package intel-oneapi-dal-devel-2021.1.1.x86_64 0:2021.1.1-79 will be installed
--> Processing Dependency: intel-oneapi-dal-2021.1.1 for package: intel-oneapi-dal-devel-2021.1.1-2021.1.1-79.x86_64
--> Processing Dependency: intel-oneapi-common-licensing for package: intel-oneapi-dal-devel-2021.1.1-2021.1.1-79.x86_64
--> Processing Dependency: intel-oneapi-common-vars for package: intel-oneapi-dal-devel-2021.1.1-2021.1.1-79.x86_64
--> Processing Dependency: intel-oneapi-condaindex for package: intel-oneapi-dal-devel-2021.1.1-2021.1.1-79.x86_64
--> Processing Dependency: intel-oneapi-dal-common-devel-2021.1.1 for package: intel-oneapi-dal-devel-2021.1.1-2021.1.1-79.x86_64
---> Package intel-oneapi-tbb-devel.x86_64 0:2021.1.1-119 will be installed
---> Package intel-oneapi-tbb-devel-2021.1.1.x86_64 0:2021.1.1-119 will be installed
--> Processing Dependency: intel-oneapi-tbb-common-devel-2021.1.1 for package: intel-oneapi-tbb-devel-2021.1.1-2021.1.1-119.x86_64
--> Processing Dependency: intel-oneapi-tbb-2021.1.1 for package: intel-oneapi-tbb-devel-2021.1.1-2021.1.1-119.x86_64
--> Running transaction check
---> Package intel-oneapi-common-licensing-2021.1.1.noarch 0:2021.1.1-60 will be installed
---> Package intel-oneapi-common-vars.noarch 0:2021.2.0-195 will be installed
---> Package intel-oneapi-condaindex.x86_64 0:2021.2.0-94 will be installed
---> Package intel-oneapi-dal-2021.1.1.x86_64 0:2021.1.1-79 will be installed
--> Processing Dependency: intel-oneapi-compiler-dpcpp-cpp-runtime for package: intel-oneapi-dal-2021.1.1-2021.1.1-79.x86_64
--> Processing Dependency: intel-oneapi-dal-common-2021.1.1 for package: intel-oneapi-dal-2021.1.1-2021.1.1-79.x86_64
---> Package intel-oneapi-dal-common-devel-2021.1.1.noarch 0:2021.1.1-79 will be installed
---> Package intel-oneapi-tbb-2021.1.1.x86_64 0:2021.1.1-119 will be installed
--> Processing Dependency: intel-oneapi-tbb-common-2021.1.1 for package: intel-oneapi-tbb-2021.1.1-2021.1.1-119.x86_64
---> Package intel-oneapi-tbb-common-devel-2021.1.1.noarch 0:2021.1.1-119 will be installed
--> Running transaction check
---> Package intel-oneapi-compiler-dpcpp-cpp-runtime.x86_64 0:2021.2.0-610 will be installed
--> Processing Dependency: intel-oneapi-compiler-dpcpp-cpp-runtime-2021.2.0 for package: intel-oneapi-compiler-dpcpp-cpp-runtime-2021.2.0-610.x86_64
---> Package intel-oneapi-dal-common-2021.1.1.noarch 0:2021.1.1-79 will be installed
---> Package intel-oneapi-tbb-common-2021.1.1.noarch 0:2021.1.1-119 will be installed
--> Running transaction check
---> Package intel-oneapi-compiler-dpcpp-cpp-runtime-2021.2.0.x86_64 0:2021.2.0-610 will be installed
--> Processing Dependency: intel-oneapi-compiler-shared-runtime-2021.2.0 for package: intel-oneapi-compiler-dpcpp-cpp-runtime-2021.2.0-2021.2.0-610.x86_64
--> Processing Dependency: intel-oneapi-common-licensing-2021.2.0 for package: intel-oneapi-compiler-dpcpp-cpp-runtime-2021.2.0-2021.2.0-610.x86_64
--> Processing Dependency: intel-oneapi-tbb-2021.2.0 for package: intel-oneapi-compiler-dpcpp-cpp-runtime-2021.2.0-2021.2.0-610.x86_64
--> Running transaction check
---> Package intel-oneapi-common-licensing-2021.2.0.noarch 0:2021.2.0-195 will be installed
---> Package intel-oneapi-compiler-shared-runtime-2021.2.0.x86_64 0:2021.2.0-610 will be installed
--> Processing Dependency: intel-oneapi-compiler-shared-common-runtime-2021.2.0 for package: intel-oneapi-compiler-shared-runtime-2021.2.0-2021.2.0-610.x86_64
--> Processing Dependency: intel-oneapi-openmp-2021.2.0 for package: intel-oneapi-compiler-shared-runtime-2021.2.0-2021.2.0-610.x86_64
---> Package intel-oneapi-tbb-2021.2.0.x86_64 0:2021.2.0-357 will be installed
--> Processing Dependency: intel-oneapi-tbb-common-2021.2.0 for package: intel-oneapi-tbb-2021.2.0-2021.2.0-357.x86_64
--> Running transaction check
---> Package intel-oneapi-compiler-shared-common-runtime-2021.2.0.noarch 0:2021.2.0-610 will be installed
---> Package intel-oneapi-openmp-2021.2.0.x86_64 0:2021.2.0-610 will be installed
--> Processing Dependency: intel-oneapi-openmp-common-2021.2.0 for package: intel-oneapi-openmp-2021.2.0-2021.2.0-610.x86_64
---> Package intel-oneapi-tbb-common-2021.2.0.noarch 0:2021.2.0-357 will be installed
--> Running transaction check
---> Package intel-oneapi-openmp-common-2021.2.0.noarch 0:2021.2.0-610 will be installed
--> Processing Conflict: intel-oneapi-compiler-shared-common-runtime-2021.2.0-2021.2.0-610.noarch conflicts intel-oneapi-common-licensing < 2021.2.0
--> Processing Conflict: intel-oneapi-openmp-common-2021.2.0-2021.2.0-610.noarch conflicts intel-oneapi-common-licensing < 2021.2.0
--> Processing Conflict: intel-oneapi-compiler-dpcpp-cpp-runtime-2021.2.0-2021.2.0-610.x86_64 conflicts intel-oneapi-common-licensing < 2021.2.0
--> Processing Conflict: intel-oneapi-compiler-dpcpp-cpp-runtime-2021.2.0-2021.2.0-610.x86_64 conflicts intel-oneapi-tbb < 2021.2.0
--> Processing Conflict: intel-oneapi-tbb-2021.2.0-2021.2.0-357.x86_64 conflicts intel-oneapi-common-licensing < 2021.2.0
--> Processing Conflict: intel-oneapi-tbb-2021.2.0-2021.2.0-357.x86_64 conflicts intel-oneapi-tbb-common < 2021.2.0
--> Processing Conflict: intel-oneapi-openmp-2021.2.0-2021.2.0-610.x86_64 conflicts intel-oneapi-common-licensing < 2021.2.0
--> Processing Conflict: intel-oneapi-common-vars-2021.2.0-195.noarch conflicts intel-oneapi-common-licensing < 2021.2.0
--> Processing Conflict: intel-oneapi-compiler-shared-runtime-2021.2.0-2021.2.0-610.x86_64 conflicts intel-oneapi-common-licensing < 2021.2.0
--> Processing Conflict: intel-oneapi-tbb-common-2021.2.0-2021.2.0-357.noarch conflicts intel-oneapi-common-licensing < 2021.2.0
--> Finished Dependency Resolution
Error: intel-oneapi-compiler-shared-common-runtime-2021.2.0 conflicts with intel-oneapi-common-licensing-2021.1.1-2021.1.1-60.noarch
Error: intel-oneapi-openmp-common-2021.2.0 conflicts with intel-oneapi-common-licensing-2021.1.1-2021.1.1-60.noarch
Error: intel-oneapi-compiler-dpcpp-cpp-runtime-2021.2.0 conflicts with intel-oneapi-tbb-2021.1.1-2021.1.1-119.x86_64
Error: intel-oneapi-common-vars conflicts with intel-oneapi-common-licensing-2021.1.1-2021.1.1-60.noarch
Error: intel-oneapi-openmp-2021.2.0 conflicts with intel-oneapi-common-licensing-2021.1.1-2021.1.1-60.noarch
Error: intel-oneapi-compiler-dpcpp-cpp-runtime-2021.2.0 conflicts with intel-oneapi-common-licensing-2021.1.1-2021.1.1-60.noarch
Error: intel-oneapi-compiler-shared-runtime-2021.2.0 conflicts with intel-oneapi-common-licensing-2021.1.1-2021.1.1-60.noarch
Error: intel-oneapi-tbb-2021.2.0 conflicts with intel-oneapi-tbb-common-2021.1.1-2021.1.1-119.noarch
Error: intel-oneapi-tbb-2021.2.0 conflicts with intel-oneapi-common-licensing-2021.1.1-2021.1.1-60.noarch
Error: intel-oneapi-tbb-common-2021.2.0 conflicts with intel-oneapi-common-licensing-2021.1.1-2021.1.1-60.noarch
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

Cannot compile Intel-mllib of master branch.

We try to compile Intel-mllib of master branch and meet the error like:
image
The maven command is "mvn clean package -DskipTests -Pspark-3.1.1" and we've successfully installed oneapi and oneccl.

[Release] Meet hang issue when running PCA algorithm.

In our automatic tests for OAP product, we found that using Intel-MLlib to run PCA and Kmeans algorithms often meet hanging issue which lead to block the whole workflow. The phenomenon is shown in the picture below:
image

[Optimization] Use ccl::all2all to simulate gather

Problem: Currently we used ccl::allgather as Gather. The network bandwidth is wasted due to unnecessary transfer to worker ranks.
Solution: Implement gather using ccl::all2all to avoid unnecessary data transfer, only transfer data to root rank.

[PIP] Misc improvements and refactor code

CI:

Fix CI tests bugs

Code Style:

Fix lint_scala bug
Apply styles for java, scala and c++
Add license header
rename all fit functions to train

Kmeans:

improve data conversion and cache & unpersist
refactor code for each profile

ALS:

Remove redundent code

OneCCL:

improve error message output

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.