Giter VIP home page Giter VIP logo

mlsm's Introduction

MLSM - Multiple Lead Score Models

Red Hat's implementation of multiple data science models against input data sets.

Introduction

Operationalizing outputs from data science efforts is a tricky art. With so much data science relying on cutting-edge techniques, it can be difficult to balance that with the need for a stable infrastructure. Running a static analysis to provide either a spreadsheet or slide deck is fairly straightforward, but also prone to human error, time constraints, and limited resources. If you're trying to make real-time decisions based on data science models, the engine for processing models has to be reliable.

Our original use case for this package was deploying lead scoring models: given information we know about one person (based on marketing data), how should they be prioritized when being passed over to sales? If we implement a model for scoring leads, and the infrastructure breaks (due to deploying a new model, uncaught exceptions, etc), there are immediate downstream impacts to other groups who depend on this information.

Challenges with other platforms:

  • Limited modeling options; usually only simple "if-then" statements
  • Slow processing times
  • Lack of flexibility to change model types quickly
  • Lack of version control

To address these challenges, we built the mlsm package.

Using a Python framework offers the following solutions:

  • Limited modeling options; usually only simple "if-then" statements
    • Python offers direct access to data science standards, such as linear regression, random forests, and machine learning
  • Slow processing times
    • By running Python on a server, we can add more resources to speed up processing as needed
  • Lack of flexibility to quickly deploy new models
    • With a standardized architecture designed for multiple concurrent models, deploying new models is easy
  • Lack of version control
    • By deploying with a Git-based system, can easily view changes and roll back to older code; architecture for multiple models allows easier comparison of outputs against live data

Architecture

Model class

Model is a high-level wrapper for functions that run models. This not only puts functions in a framework to be more consistently applied, it also facilitates easier storage of results by model name/version, better validation of input data, and simplifying any script which actually applies models.

The inputs for executing a function, data and results, are expected to a single record (the function RunModelsAll is used to apply across multiple records).

Model also collects a fields dict, used to validate input values before execution (future enhancements will include extreme unit/case/exception handling).

SummaryModel class

SummaryModel is an extension of Model - it expects output of previously run Models to be passed (via results) for the purpose of running a second-level computation. For example, if three Models are run, a SummaryModel could then be used to determine which score 'wins' (or, which score is passed back to business users).

When passed to RunModelsAll, every SummaryModel will be run after all Models have run.

Data Flow

Pre-run

  • An ETL script prepares data for processing (not part of the mlsm package)
    • Each record is a dict
    • Each record has a unique identifier (which can later be passed as a parameter to RunModelsAll)
    • Data for each record is structured in sub-dictionaries
      • data top-level dict
      • Model.name sub-level
      • Model.version sub-level - data values contained here
    • When running a Model, the data passed to the underlying function will be taken from the path data[Model.name][Model.version]

RunModelsAll

  • For each record passed (records):
    • For each Model passed (models):
      • Execute model
    • For each SummaryModel passed (summaryModels):
      • Execute summary model
    • If DB parameters passed:
      • Store output results object in MongoDB

Setup Process

Hosting

We use Red Hat's Openshift (Python 3.3 cartridge), with an additional install script to run Anaconda for more advanced models

Creating and deploying models

We follow this procedure for creating/deploying models (also workflow for passing models from data scientists to engineers for deployment):

Data Scientists

  1. Create a new local git repo (and a remote for backup)
  2. Use git-flow for develop/release/feature management
  3. When a new model is ready to be deployed, push to remote and inform engineering team with commit ID
  • If testing a model from the develop branch, then must set Model.status='draft'

Engineers

  1. Pull model repo to local
  2. Add to deployment code repo
  3. Import model to script which runs RunModelsAll
  4. Add any additional requirements to existing ETL scripts

Future

Implementation standard leveraging Jenkins, so Data Scientists can have their latest models pulled directly in from remote

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.