Giter VIP home page Giter VIP logo

pyspark-me's Introduction

pyspark-me

Databricks client SDK for Python with command line interface for Databricks REST APIs.

{:toc}

Introduction

Pysparkme package provides python SDK for Databricks REST API:

  • dbfs
  • workspace
  • jobs
  • runs

The package also comes with a useful CLI which might be very helpful in automation.

Installation

$ pip install pyspark-me

Databricks CLI dbr-me

dbr-me Databricks command line client provides convenient way to interact with Databricks cluster at the command line. A very popular use of such approach in in automation tasks, like DevOps pipelines or third party workflow managers.

You can call the Databricks CLI using convenient shell command dbr-me:

$ dbr-me --help

or using python module:

$ python -m pysparkme.databricks.cli --help

To connect to the Databricks cluster, you can supply arguments at the command line:

  • --bearer-token
  • --url
  • --cluster-id

Alternatively, you can define environment variables. Command line arguments take precedence.

export DATABRICKS_URL='https://westeurope.azuredatabricks.net/'
export DATABRICKS_BEARER_TOKEN='dapixyz89u9ufsdfd0'
export DATABRICKS_CLUSTER_ID='1234-456778-abc234'
export DATABRICKS_ORG_ID='87287878293983984'

DBFS

List DBFS items

# List items on DBFS
dbr-me dbfs ls --json-indent 3 FileStore/movielens
[
   {
      "path": "/FileStore/movielens/ml-latest-small",
      "is_dir": true,
      "file_size": 0,
      "is_file": false,
      "human_size": "0 B"
   }
]

Download file from DBFS

# Download a file and print to STDOUT
dbr-me dbfs get ml-latest-small/movies.csv

Download directory from DBFS

# Download recursively entire directory and store locally
dbr-me dbfs get -o ml-local ml-latest-small

Workspace

Databricks workspace contains notebooks and other items.

List workspace

####################
# List workspace
# Default path is root - '/'
$ dbr-me workspace ls
# auto-add leading '/'
$ dbr-me workspace ls 'Users'
# Space-indentend json output with number of spaces
$ dbr-me workspace --json-indent 4 ls
# Custom indent string
$ dbr-me workspace ls --json-indent='>'

Export items from Databricks workspace

#####################
# Export workspace items
# Export everything in source format using defaults: format=SOURCE, path=/
dbr-me workspace export -o ./.dev/export
# Export everything in DBC format
dbr-me workspace export -f DBC -o ./.dev/export.
# When path is folder, export is recursive
dbr-me workspace export -o ./.dev/export-utils 'Utils'
# Export single ITEM
dbr-me workspace export -o ./.dev/GetML 'Utils/Download MovieLens.py'

Runs

This command group implements the jobs/runs Databricks REST API.

Submit a notebook

Implements: https://docs.databricks.com/dev-tools/api/latest/jobs.html#runs-submit

$ dbr-me runs submit "Utils/Download MovieLens"
{"run_id": 4}

You can retrieve the job information using runs get:

$ dbr-me runs get 4 -i 3

If you need to pass parameters, use the --parameters or -p option and specify JSON text.

$ dbr-me runs submit -p '{"run_tag":"20250103"}' "Utils/Download MovieLens"

You can refer also to parameters in JSON file:

$ dbr-me runs submit -p '@params.json' "Utils/Download MovieLens"

You can use the parameters in the notebook and will also be able to see them in the run metadata:

dbr-me runs get-output -i 3 8
{
   "notebook_output": {
      "result": "Downloaded files (tag: 20250103): README.txt, links.csv, movies.csv, ratings.csv, tags.csv",
      "truncated": false
   },
   "error": null,
   "metadata": {
      "job_id": 8,
      "run_id": 8,
      "creator_user_name": "[email protected]",
      "number_in_job": 1,
      "original_attempt_run_id": null,
      "state": {
         "life_cycle_state": "TERMINATED",
         "result_state": "SUCCESS",
         "state_message": ""
      },
      "schedule": null,
      "task": {
         "notebook_task": {
            "notebook_path": "/Utils/Download MovieLens",
            "base_parameters": {
               "run_tag": "20250103"
            }
         }
      },
      "cluster_spec": {
         "existing_cluster_id": "xxxx-yyyyyy-zzzzzz"
      },
      "cluster_instance": {
         "cluster_id": "xxxx-yyyyyy-zzzzzzzz",
         "spark_context_id": "8734983498349834"
      },
      "overriding_parameters": null,
      "start_time": 1592067357734,
      "setup_duration": 0,
      "execution_duration": 11000,
      "cleanup_duration": 0,
      "trigger": null,
      "run_name": "pyspark-me-1592067355",
      "run_page_url": "https://westeurope.azuredatabricks.net/?o=89349849834#job/8/run/1",
      "run_type": "SUBMIT_RUN"
   }
}

Get run metadata

Implements: Databricks REST runs/get

$ dbr-me runs get -i 3 6
{
   "job_id": 6,
   "run_id": 6,
   "creator_user_name": "[email protected]",
   "number_in_job": 1,
   "original_attempt_run_id": null,
   "state": {
      "life_cycle_state": "TERMINATED",
      "result_state": "SUCCESS",
      "state_message": ""
   },
   "schedule": null,
   "task": {
      "notebook_task": {
         "notebook_path": "/Utils/Download MovieLens"
      }
   },
   "cluster_spec": {
      "existing_cluster_id": "xxxx-yyyyy-zzzzzz"
   },
   "cluster_instance": {
      "cluster_id": "xxxx-yyyyy-zzzzzz",
      "spark_context_id": "783487348734873873"
   },
   "overriding_parameters": null,
   "start_time": 1592062497162,
   "setup_duration": 0,
   "execution_duration": 11000,
   "cleanup_duration": 0,
   "trigger": null,
   "run_name": "pyspark-me-1592062494",
   "run_page_url": "https://westeurope.azuredatabricks.net/?o=398348734873487#job/6/run/1",
   "run_type": "SUBMIT_RUN"
}

List Runs

Implements: Databricks REST runs/list

$ dbr-me runs ls

To get only the runs for a particular job:

# Get job with job-id=4
$ dbr-me runs ls 4 -i 3
{
   "runs": [
      {
         "job_id": 4,
         "run_id": 4,
         "creator_user_name": "[email protected]",
         "number_in_job": 1,
         "original_attempt_run_id": null,
         "state": {
            "life_cycle_state": "PENDING",
            "state_message": ""
         },
         "schedule": null,
         "task": {
            "notebook_task": {
               "notebook_path": "/Utils/Download MovieLens"
            }
         },
         "cluster_spec": {
            "existing_cluster_id": "xxxxx-yyyy-zzzzzzz"
         },
         "cluster_instance": {
            "cluster_id": "xxxxx-yyyy-zzzzzzz"
         },
         "overriding_parameters": null,
         "start_time": 1592058826123,
         "setup_duration": 0,
         "execution_duration": 0,
         "cleanup_duration": 0,
         "trigger": null,
         "run_name": "pyspark-me-1592058823",
         "run_page_url": "https://westeurope.azuredatabricks.net/?o=abcdefghasdf#job/4/run/1",
         "run_type": "SUBMIT_RUN"
      }
   ],
   "has_more": false
}

Export run

Implements: Databricks REST runs/export

$ dbr-me runs export --content-only 4 > .dev/run-view.html

Get run output

Implements: Databricks REST runs/get-output

$ dbr-me runs get-output -i 3 6
{
   "notebook_output": {
      "result": "Downloaded files: README.txt, links.csv, movies.csv, ratings.csv, tags.csv",
      "truncated": false
   },
   "error": null,
   "metadata": {
      "job_id": 5,
      "run_id": 5,
      "creator_user_name": "[email protected]",
      "number_in_job": 1,
      "original_attempt_run_id": null,
      "state": {
         "life_cycle_state": "TERMINATED",
         "result_state": "SUCCESS",
         "state_message": ""
      },
      "schedule": null,
      "task": {
         "notebook_task": {
            "notebook_path": "/Utils/Download MovieLens"
         }
      },
      "cluster_spec": {
         "existing_cluster_id": "xxxx-yyyyy-zzzzzzz"
      },
      "cluster_instance": {
         "cluster_id": "xxxx-yyyyy-zzzzzzz",
         "spark_context_id": "8973498743973498"
      },
      "overriding_parameters": null,
      "start_time": 1592062147101,
      "setup_duration": 1000,
      "execution_duration": 11000,
      "cleanup_duration": 0,
      "trigger": null,
      "run_name": "pyspark-me-1592062135",
      "run_page_url": "https://westeurope.azuredatabricks.net/?o=89798374987987#job/5/run/1",
      "run_type": "SUBMIT_RUN"
   }
}

To get only the exit output:

$ dbr-me runs get-output -r 6
Downloaded files: README.txt, links.csv, movies.csv, ratings.csv, tags.csv

Python Client SDK for Databricks REST APIs

To implement your own Databricks REST API client, you can use the Python Client SDK for Databricks REST APIs.

Create Databricks connection

# Get Databricks workspace connection
dbc = pysparkme.databricks.connect(
        bearer_token='dapixyzabcd09rasdf',
        url='https://westeurope.azuredatabricks.net')

DBFS

# Get list of items at path /FileStore
dbc.dbfs.ls('/FileStore')

# Check if file or directory exists
dbc.dbfs.exists('/path/to/heaven')

# Make a directory and it's parents
dbc.dbfs.mkdirs('/path/to/heaven')

# Delete a directory recusively
dbc.dbfs.rm('/path', recursive=True)

# Download file block starting 1024 with size 2048
dbc.dbfs.read('/data/movies.csv', 1024, 2048)

# Download entire file
dbc.dbfs.read_all('/data/movies.csv')

Databricks workspace

# List root workspace directory
dbc.workspace.ls('/')

# Check if workspace item exists
dbc.workspace.exists('/explore')

# Check if workspace item is a directory
dbc.workspace.is_directory('/')

# Export notebook in default (SOURCE) format
dbc.workspace.export('/my_notebook')

# Export notebook in HTML format
dbc.workspace.export('/my_notebook', 'HTML')

Build and publish

python setup.py sdist bdist_wheel
python -m twine upload dist/*

pyspark-me's People

Contributors

ivangeorgiev avatar

Watchers

 avatar  avatar

pyspark-me's Issues

List Databrick runs using Python API

Synopsis

dbc = pysparkme.databricks.connect()
dbc.jobs.runs.ls()

Method signature

def ls(job_id=None, offset=None, limit=None,
        completed_only=False, active_only=False):

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.