Giter VIP home page Giter VIP logo

azure-databricks-recommendation's Introduction

Build Status

Introduction

The following is a Movie Recommendation System Data pipeline implemented within Azure Databricks. This solution aims to demonstrate Databricks as a Unified Analytics Platform by showing an end-to-end data pipeline including:

  1. Initial ETL data loading process
  2. Ingesting succeeding data through Spark Structured Streaming
  3. Model training and scoring
  4. Persisting trained model
  5. Productionizing model through batch scoring jobs
  6. User dashboards

Architecture

Movie ratings data is generated via a simple .NET core application running in an Azure Container instance which sends this data into an Azure Event Hub. The movie ratings data is then consumed and processed by a Spark Structured Streaming (Scala) job within Azure Databricks. The recommendation system makes use of a collaborative filtering model, specifically the Alternating Least Squares (ALS) algorithm implemented in Spark ML and pySpark (Python). The solution also contains two scheduled jobs that demonstrates how one might productionize the fitted model. The first job creates daily top 10 movie recommendations for all users while the second job retrains the model with the newly received ratings data. The solution also demonstrates Sparks Model Persistence in which one can load a model in a different language (Scala) from what it was originally saved as (Python). Finally, the data is visualized with a parameterize Notebook / Dashboard using Databricks Widgets.

DISCLAIMER: Code is not designed for Production and is only for demonstration purposes.

Architecture

Dashboard

The following shows the Movie Recommendations dashboard by User Id.

Dashboard

To access the Dashboard, go to Workspace > recommender_dashboard > 07_user_dashboard then select View > User Recommendation Dashboard

Deployment

You can use the following docker container to deploy the solution:

  • docker run -it devlace/azdatabricksrecommend

Or, alternatively, build and run the container locally with:

  • make deploy_w_docker

For local deployment without Docker

Ensure you are in the root of the repository and logged in to the Azure cli by running az login.

Requirements

Development environment

  • The following works with Windows Subsystem for Linux
  • Clone this repository
  • cd azure-databricks-recommendation
  • virtualenv . This creates a python virtual environment to work in.
  • source bin/activate This activates the virtual environment.
  • make requirements. This installs python dependencies in the virtual environment.

Deploy Entire Solution

  • To deploy the solution, simply run make deploy and fill in the prompts.
  • When prompted for a Databricks Host, enter the full name of your databricks workspace host, e.g. https://southeastasia.azuredatabricks.net
  • When prompted for a token, you can generate a new token in the databricks workspace.
  • To view additional make commands run make

Data

This solutions makes use of the MovieLens Dataset*

Project Organization


├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make deploy`
├── README.md          <- The top-level README for developers using this project.
├── data
│   │
│   └── raw            <- The original, immutable data dump.
├── deploy             <- Deployment artifacts
│   │
│   └── databricks     <- Deployment artifacts in relation to the Databricks workspace
│   │
│   └── deploy.sh      <- Deployment script to deploy all Azure Resources
│   │
│   └── azuredeploy.json <- Azure ARM template w/ .parameters file
│   │
│   └── Dockerfile     <- Dockerfile for deployment
│
├── notebooks          <- Azure Databricks Jupyter notebooks. 
│
├── references         <- Contains the powerpoint presentation, and other reference materials.
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code for use in this project.
    ├── __init__.py    <- Makes src a Python module
    │
    ├── data           <- Scripts to download or generate data
    │
    └── EventHubGenerator  <- Visual Studio solution EventHub Data Generator (Ratings)

Project based on the cookiecutter data science project template. #cookiecutterdatascience

*F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

azure-databricks-recommendation's People

Contributors

devlace avatar xtellurian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

azure-databricks-recommendation's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.