Welcome to the Iguazio Data Science Platform

Platform Overview
Data Science Workflow
End-to-End Use-Case Applications

Pre-deployed demos —
Additional demos (see the download instructions) —
Jupyter Notebook Basics
- Creating Virtual Environments in Jupyter Notebook
- Updating the Tutorial Notebooks
Additional Resources
- Platform Documentation, Examples, and Sample Data Sets
- Third-Party Documentation, Examples, and Sample Data Sets
Support

Platform Overview

The Iguazio Data Science Platform ("the platform") is a fully integrated and secure data science platform as a service (PaaS), which simplifies development, accelerates performance, facilitates collaboration, and addresses operational challenges. The platform incorporates the following components:

A data science workbench that includes Jupyter Notebook, integrated analytics engines, and Python packages
Model management with experiments tracking and automated pipeline capabilities
Managed data and machine-learning (ML) services over a scalable Kubernetes cluster
A real-time serverless functions framework — Nuclio
An extremely fast and secure data layer that supports SQL, NoSQL, time-series databases, files (simple objects), and streaming
Integration with third-party data sources such as Amazon S3, HDFS, SQL databases, and streaming or messaging protocols
Real-time dashboards based on Grafana

The platform uses Kubernetes (k8s) as the baseline cluster manager, and deploys various application microservices on top of Kubernetes to address different data science tasks. Most of the provided services support scaling out and GPU acceleration and have a secure and low-latency access to the platform's shared data store and file system, enabling high performance and scalability with maximum resource efficiency.

The platform makes extensive use of Nuclio serverless functions to automate various tasks — such as data collection, extract-transform-load (ETL) processes, model serving, and batch jobs. Nuclio functions describe the code and include all the required resource definitions and configuration for running the code. The functions auto scale and can be versioned. The platform supports various methods for generating Nuclio functions — using the graphical dashboard, Docker, Git, or Jupyter Notebook — as demonstrated in the platform tutorials.

For a more in-depth introduction to the platform, see the following resources:

A good place to start your development is with the platform tutorial Jupyter notebooks, which are available in the home directory of the platform's Jupyter Notebook service; see especially the getting-started examples and full use-case demo applications. You can find a tutorials overview in the Jupyter Notebook Basics section of this document.

Data Science Workflow

The Iguazio Data Science Platform provides a complete data science workflow in a single ready-to-use platform that includes all the required building blocks for creating data science applications from research to production:

Collect, explore, and label data from various real-time or offline sources
Run ML training and validation at scale over multiple CPUs and GPUs
Deploy models and applications into production with serverless functions
Log, monitor, and visualize all your data and services

Collecting and Ingesting Data

There are many ways to collect and ingest data from various sources into the platform:

Streaming data in real time from sources such as Kafka, Kinesis, Azure Event Hubs, or Google Pub/Sub.
Loading data directly from external databases using an event-driven or periodic/scheduled implementation. See the explanation and examples in the read-external-db tutorial.
Loading files (objects), in any format (for example, CSV, Parquet, JSON, or a binary image), from internal or external sources such as Amazon S3 or Hadoop. See, for example, the file-access tutorial.
Importing time-series telemetry data using a Prometheus compatible scraping API.
Ingesting (writing) data directly into the system using RESTful AWS-like simple-object, streaming, or NoSQL APIs. See the platform's Web-API References.
Scraping or reading data from external sources — such as Twitter, weather services, or stock-trading data services — using serverless functions. See, for example, the stocks demo use-case application.

For more information and examples of data collection and ingestion with the platform, see the getting-started-basic tutorial Jupyter notebook.

Exploring and Processing Data

The platform includes a wide range of integrated open-source data query and exploration tools, including the following:

Apache Spark data-processing engine — including the Spark SQL and Datasets, MLlib, R, and GraphX libraries — with real-time access to the platform's NoSQL data store and file system. See the platform's Spark APIs reference and the examples in the spark-sql-analytics tutorial.
Presto distributed SQL query engine, which can be used to run interactive SQL queries over platform NoSQL tables or other object (file) data sources. See the platform's Presto reference.
pandas Python analysis library, including structured DataFrames.
Dask parallel-computing Python library, including scaled pandas DataFrames.
V3IO Frames — Iguazio's open-source data-access library, which provides a unified high-performance API for accessing NoSQL, stream, and time-series data in the platform's data store and features native integration with pandas and NVIDIA RAPIDS. See, for example, the frames tutorial.
Built-in support for ML packages such as scikit-learn, Pyplot, NumPy, PyTorch, and TensorFlow.

All these tools are integrated with the platform's Jupyter Notebook service, allowing users to access the same data from Jupyter through different interfaces with minimal configuration overhead. Users can easily install additional Python packages by using the Conda binary package and environment manager and the pip Python package installer, which are both available as part of the Jupyter Notebook service. This design, coupled with the platform's unified data model, enables users to store and access data using different formats — such as NoSQL ("key/value"), time series, stream data, and files (simple objects) — and leverage different tools and APIs for accessing and manipulating the data, all from a single development environment (namely, Jupyter Notebook).

Note: You can deploy and manage application services, such as Spark and Jupyter Notebook, from the Services page of the platform dashboard.

For more information and examples of data exploration with the platform, see the collect-n-explore tutorial Jupyter notebook.

Building and Training Models

You can develop and test data science models in the platform's Jupyter Notebook service or in your preferred external editor. When your model is ready, you can train it in Jupyter Notebook or by using scalable cluster resources such as Nuclio functions, Dask, Spark ML, or Kubernetes jobs. You can find model-training examples in the platform's tutorial Jupyter notebooks:

The NetOps demo tutorial demonstrates predictive infrastructure-monitoring using scikit-learn.
The image-classification demo tutorial demonstrates image recognition using TensorFlow and Horovod with MLRun.

If you're are a beginner, you might find the following ML guide useful — Machine Learning Algorithms In Layman's Terms.

Experiment Tracking

One of the most important and challenging areas of managing a data science environment is the ability to track experiments. Data scientists need a simple way to track and view current and historical experiments along with the metadata that is associated with each experiment. This capability is critical for comparing different runs, and eventually helps to determine the best model and configuration for production deployment. The platform leverages the open-source MLRun library to help tackle these challenges. You can find examples of using MLRun in the MLRun demos.

Deploying Models to Production

The platform allows you to easily deploy your models to production in a reproducible way by using the open-source Nuclio serverless framework. You provide Nuclio with code or Jupyter notebooks, resource definitions (such as CPU, memory, and GPU), environment variables, package or software dependencies, data links, and trigger information. Nuclio uses this information to automatically build the code, generate custom container images, and connect them to the relevant compute or data resources. The functions can be triggered by a wide variety of event sources, including the most commonly used streaming and messaging protocols, HTTP APIs, scheduled (cron) tasks, and batch jobs.

Nuclio functions can be created from the platform dashboard or by using standard code IDEs, and can be deployed on your platform cluster. A convenient way to develop and deploy Nuclio functions is by using Jupyter Notebook and Python tools. For detailed information about Nuclio, visit the Nuclio web site and see the product documentation.

Note: Nuclio functions aren't limited to model serving: they can automate data collection, serve custom APIs, build real-time feature vectors, drive triggers, and more.

For an overview of Nuclio and how to develop, document, and deploy serverless Python Nuclio functions from Jupyter Notebook, see the nuclio-jupyter documentation. You can also find examples in the platform tutorial Jupyter notebooks; for example, the NetOps demo tutorial demonstrates how to deploy a network-operations model as a function.

Visualization, Monitoring, and Logging

Data in the platform — including collected data, internal or external telemetry and logs, and program-output data — can be analyzed and visualized in different ways simultaneously. The platform supports multiple standard data analytics and visualization tools, including SQL, Prometheus, Grafana, and pandas. For example, you can plot or chart data within Jupyter Notebook using Matplotlib; use your favorite BI visualization tools, such as Tableau, to query data in the platform over a Java database connectivity connector (JDBC); or build real-time dashboards in Grafana.

The data analytics and visualization tools and services generate telemetry and log data that can be stored using the platform's time-series database (TSDB) service or by using external tools such as Elasticsearch. Platform users can easily instrument code and functions to collect various statistics or logs, and explore the collected data in real time.

The Grafana open-source analytics and monitoring framework is natively integrated into the platform, allowing users to create dashboards that provide access to platform NoSQL tables and time-series databases from different dashboard widgets. You can also create Grafana dashboards programmatically (for example, from Jupyter Notebook) using wizard scripts. For information on how to create Grafana dashboards to monitor and visualize data in the platform, see Adding a Custom Grafana Dashboard.

End-to-End Use-Case Applications

Iguazio provides full end-to-end use-case applications (demos) that demonstrate how to use the Iguazio Data Science Platform and related tools to address data science requirements for different industries and implementations. Some of the demos are pre-deployed with the platform and available in the demos tutorial-notebooks directory.
You can get additional demos from the MLRun demos repository by running the following code.

Note: Some of the MLRun demos are still works in progress.

# Get MLRun demos
!chmod +x /User/get-demos.sh
!/User/get-demos.sh

The downloaded demos include the following applications; for more details, see demos/README-MLRUN.md (which is created as part of the download):

XGBoost classification (xgboost) — uses XGBoost to perform binary classification on the Iris data set (a popular machine-learning use case), and runs parallel model training with hyperparameters.
LightGBM classification (lightgbm) — uses LightGBM to perform binary classification on the HIGGS data set (a popluar machine-learning competition use case), and runs parallel model training with hyperparameters.
Face recognition (faces) — implements real-time capture of face images, image recognition, and location tracking of identities.
Serverless Spark (spark) — demonstrates how to run the same Spark job locally and as a distributed MLRun job over Kubernetes. The Spark function can be incorporated as a step in various data-preparation and machine-learning scenarios.
Image recognition (image_classification) — builds and trains an ML model that identifies (recognizes) and classifies images by using Keras, TensorFlow, and scikit-learn.
Predictive infrastructure monitoring (netops) — builds, trains, and deploys a machine-learning model for analyzing and predicting failure in network devices as part of a network operations (NetOps) flow. The goal is to identify anomalies for device metrics — such as CPU, memory consumption, or temperature — which can signify an upcoming issue or failure.

The pre-deployed demos include the following use-cases applications; for more details, see demos/README.md (available also as a notebook:

Natural language processing (NLP) (nlp) — processes natural-language textual data — including spelling correction and sentiment analysis — and generates a Nuclio serverless function that translates any given text string to another (configurable) language.
Stream enrichment (stream-enrich) — implements a typical stream-based data-engineering pipeline, which is required in many real-world scenarios: data is streamed from an event streaming engine; the data is enriched, in real time, using data from a NoSQL table; the enriched data is saved to an output data stream and then consumed from this stream.
Smart stock trading (stocks) — reads stock-exchange data from an internet service into a time-series database (TSDB); uses Twitter to analyze the market sentiment on specific stocks, in real time; and saves the data to a platform NoSQL table that is used for generating reports and analyzing and visualizing the data on a Grafana dashboard.
Location-based recommendations (location-based-recommendations) — generates real-time product purchase recommendations for users of a credit-card company based on the users' physical location.
Real-time user segmentation (slots-stream) — builds a stream-event processor on a sliding time window for tagging and untagging users based on programmatic rules of user behavior.

Jupyter Notebook Basics

The platform's Jupyter Notebook service displays the JupyterLab UI, which consists of a collapsible left sidebar, a main work area (on the right), and a top menu bar. For details, see the JupyterLab documentation.

The main work area (on the right) contains tabs of documents and activities — for creating, viewing, editing, and running interactive notebooks, shell terminals, or consoles, as well as viewing and editing other common file types. To create a new notebook or terminal, select the New Launcher option (+ icon) from the top action toolbar in the left sidebar.

The top menu bar exposes available top-level actions, such as exporting a notebook in a different format.

The left-sidebar menu contains commonly used tabs, including a File Browser (directory icon) for browsing files.
The home directory of the platform's Jupyter Notebook service contains the following files and directories:

v3io directory, which displays the contents of the v3io platform cluster data mount for browsing the contents of the cluster's data containers. You can also browse the contents of the data containers from the Data page of the platform dashboard.
The contents of the running-user home directory — users/<running user>. This directory contains the platform's tutorial Jupyter notebooks:
- welcome.ipynb / README.md — the current document, which provides a short introduction to the platform and how to use it to implement a full data science workflow.
- getting-started — a directory containing getting-started tutorials that explain and demonstrate how to perform different platform operations using the platform APIs and integrated tools.
- demos — a directory containing end-to-end application use-case demos.
- Scripts and related notebooks for updating the tutorial notebooks and downloading additional demo applications.

For information about the predefined data containers and how to reference data in these containers, see Platform Data Containers in the getting-started-basic tutorial notebook.

Creating Virtual Environments in Jupyter Notebook

A virtual environment is a named, isolated, working copy of Python that maintains its own files, directories, and paths so that you can work with specific versions of libraries or Python itself without affecting other Python projects. Virtual environments make it easy to cleanly separate projects and avoid problems with different dependencies and version requirements across components. See the virutal-env tutorial notebook for step-by-step instructions for using conda to create your own Python virtual environments, which will appear as custom kernels in Jupyter Notebook.

Updating the Tutorial Notebooks to the Latest Version

You can use the provided igz-tutorials-get.sh script to update the tutorial notebooks to the latest stable version available on GitHub. For details, see the update-tutorials.ipynb notebook.

zilbermanor / tutorials Goto Github PK

tutorials's Introduction

Welcome to the Iguazio Data Science Platform

Platform Overview

Data Science Workflow

Collecting and Ingesting Data

Exploring and Processing Data

Building and Training Models

Experiment Tracking

Deploying Models to Production

Visualization, Monitoring, and Logging

End-to-End Use-Case Applications

Jupyter Notebook Basics

Creating Virtual Environments in Jupyter Notebook

Updating the Tutorial Notebooks to the Latest Version

Additional Resources

Platform Documentation, Examples, and Sample Data Sets

Third-Party Documentation, Examples, and Sample Data Sets

Support

tutorials's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org