Giter VIP home page Giter VIP logo

databricks-r-env's Introduction

R & RStudio Environment Setup on Databricks

Artifacts to assist with establishing R/RStudio workloads on Databricks.

Init Script

In /init-scripts there is currently one notebook which configures & installs:

  • Simba Spark ODBC driver (2.6.19)
  • {mlflow} and {odbc} (MRAN snapshot 2022-02-24)
  • ODBC data sources:
    • databricks-self: Existing cluster (self)
    • databricks: Any databricks endpoint/cluster
  • RStudio Connection Snippets

RStudio is not installed as part of the init script as it is pre-installed with the ML variant of the Databricks Runtime (DBR). It is recommended that you use the ML runtime (prefereably LTS) in order to reduce cluster start times.

To ensure ODBC connections work seamlessly its recommended to update the init script. The start of the script includes the following:

# SET VARIABLES
WORKSPACE_ID=<Workspace ID>
WORKSPACE_URL=<Workspace URL>
MRAN_SNAPSHOT=<MRAN Snapshot Date>

This should look something like...

# SET VARIABLES
WORKSPACE_ID=123123123123123
WORKSPACE_URL=XXXXXXXXXX.cloud.databricks.com
MRAN_SNAPSHOT=2022-02-24

WORKSPACE_ID can be derived via the workspace URL (after ?o=) or by asking your Databricks account admin. MRAN_SNAPSHOT is found via DBR release notes, see below.

Cluster Policies

In /cluster-policies there are:

  • rstudio-generic.json:

    • DBR 10.4 ML LTS (10.4.x-cpu-ml-scala2.12) (forced)
    • Auto-termination disabled (forced)
    • Set purpose tag to rstudio (forced)
    • Set init_scripts to include init script dbfs:/databricks/init/r-env-init-aws.sh (forced)
    • Policy only works for all-purpose clusters, will not work for job clusters
  • rstudio-single-node.json:

    • Extends rstudio-generic.json as baseline
    • Sets cluster to SingleNode mode

For further information on configuring cluster policies see the docs.

Connecting to ODBC/JDBC

ODBC

The init script will configure two ODBC data sources:

  1. databricks-self: Existing cluster (self)
  2. databricks: Any databricks endpoint/cluster

These will be available within the RStudio connections pane with preconfigured code snippets.

Connection Examples

PWD is expected to be a Databricks Personal Access Token.
HTTPPath is provided in the cluster/endpoint UI under ODBC settings (docs).

# connecting via ODBC to a SQL Endpoint
library(DBI)
conn <- dbConnect(
  odbc::odbc(),
  dsn      = "databricks",
  HTTPPath = "/sql/1.0/endpoints/XXXXXXXXXX",
  PWD      = "dapiXXXXXXXXXXXXX"
)
# connecting via ODBC to the same cluster that RStudio is running on
library(DBI)
conn <- dbConnect(
  odbc::odbc(),
  dsn = "databricks-self",
  PWD = "dapiXXXXXXXXXXXXX"
)

It's recommended to not store tokens or passwords in plain text. Databricks recommends the use of secret scopes which can be set and accessed through Spark configs on the Datbaricks cluster (docs).

This would enable the following:

library(DBI)
# set `spark.<property-name> {{secrets/<scope-name>/<secret-name>}}` on cluster
conn <- dbConnect(
  odbc::odbc(),
  dsn = "databricks-self",
  PWD = sparkR.conf("<property-name>")
)

Advanced

Updating Simba Drivers

Download URL and instructions for using Simba drivers:

To get the URL you will need to 'Copy Link Address' on the download button, this can then replace line 18 in the init script.

It's possible that the way the driver structures its contents may change with newer/older versions, this would then impact the ODBC configuration.

Therefore in the snippet below, the Driver path may require updating.

[databricks-self]
Driver = /opt/simba/spark/lib/64/libsparkodbc_sb64.so

Installing Packages Using MRAN Snapshot

It's recommended to use the snap MRAN snapshot as the Databricks Runtime being used. This is disclosed in the DBR release notes (example).

It's also possible to use the MRAN time machine to choose a desired snapshot.

Configuring mlflow

Add these variables to /etc/R/Renviron.site:

  • MLFLOW_PYTHON_BIN="/databricks/python/bin/python3"
  • MLFLOW_BIN="/databricks/python3/bin/mlflow"

Configuring {reticulate}

  • /etc/R/Renviron.site needs to be configured with RETICULATE_PYTHON variable.
    • This can be changed as neccessary, this is set to /databricks/python3/bin/python3.
  • /usr/lib/R/etc/Renviron.site is adjusted to update PATH with the following PATH=${PATH}:/databricks/conda/bin

Configuring RStudio

Despite the ML runtime including RStudio there may be cases where a different version is required, or Server Pro/Workbench is prefered. Documentation for these processes is found here.

databricks-r-env's People

Contributors

zacdav-db avatar

Stargazers

Eric Pak avatar Karl Brand avatar  avatar Yuqing Su avatar seychuan avatar Daniel Sommer avatar Xavier Armitage avatar Ellie avatar

Watchers

 avatar

Forkers

nateanth

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.