Giter VIP home page Giter VIP logo

rikai's Introduction

Apache License Read The Doc javadoc Pypi version Github Action stability-experimental

โ— This repository is still experimental. No API-compatibility is guaranteed.

Rikai

Rikai is a parquet based ML data format built for working with unstructured data at scale. Processing large amounts of data for ML is never trivial, but that is especially true for images and videos often at the core of deep learning applications. We are building Rikai with two main goals:

  1. Enable ML engineers/researchers to have a seamless workflow from Feature Engineering (Spark) to Training (PyTorch/Tensorflow), from notebook to production.
  2. Enable advanced analytics capabilities to support much faster active learning, model debugging, and monitoring in production pipelines.

Current (v0.0.1) main features:

  1. Native support in Spark and PyTorch for images/videos: reduce ad-hoc type conversions when moving between ETL and training.
  2. Custom functionality for working with images and videos at scale: reduce boilerplate and low-level code currently required to process images, filter/sample videos, etc.

Roadmap:

  1. TensorFlow integration
  2. Versioning support built into the dataset
  3. Richer video capabilities (ffmpeg-python integration)
  4. Declarative annotation API (think vega-lite for annotating images/videos)
  5. Data-centric analytics API (think BigQuery ML)

Example

from pyspark.ml.linalg import DenseMetrix
from rikai.types import Image, Box2d
from rikai import numpy as np

df = spark.createDataFrame(
    [{
        "id": 1,
        "mat": DenseMatrix(2, 2, range(4)),
        "image": Image("s3://foo/bar/1.png"),
        "annotations": [
            {
                "label": "cat",
                "mask": np.random(size=(256,256)),
                "bbox": Box2d(xmin=1.0, ymin=2.0, xmax=3.0, ymax=4.0)
            }
        ]
    }]
)

df.write.format("rikai").save("s3://path/to/features")

Train dataset in Pytorch

from rikai.torch import DataLoader

data_loader = DataLoader(
    "s3://path/to/features",
    batch_size=32,
    shuffle=True,
    num_workers=8,
)
for example in data_loader:
    print(example)

Getting Started

Currently Rikai is maintained for Scala 2.12 and Python 3.7 and 3.8.

There are multiple ways to install Rikai:

  1. Try it using the included Dockerfile.
  2. OR install it via pip pip install rikai, with extras for aws/gc, pytorch/tf, and others.
  3. OR install it from source

Note: if you want to use Rikai with your own pyspark, please consult rikai documentation for tips.

Docker

The included Dockerfile creates a standalone demo image with Jupyter, Pytorch, Spark, and rikai preinstalled with notebooks for you to play with the capabilities of the rikai feature store.

To build and run the docker image from the current directory:

# Clone the repo
git clone [email protected]:eto-ai/rikai rikai
# Build the docker image
docker build --tag rikai --network host .
# Run the image
docker run -p 0.0.0.0:8888:8888/tcp rikai:latest jupyter lab -ip 0.0.0.0 --port 8888

If successful, the console should then print out a clickable link to JupyterLab. You can also open a browser tab and go to localhost:8888.

Install from pypi

Base rikai library can be installed with just pip install rikai. Dependencies for supporting pytorch (pytorch and torchvision), aws (boto), jupyter (matplotlib and jupyterlab) are all part of optional extras. Many open-source datasets also use Youtube videos so we've also added pafy and youtube-dl as optional extras as well.

For example, if you want to use pytorch in Jupyter to train models on rikai datasets in s3 containing Youtube videos you would run:

pip install rikai[pytorch,aws,jupyter,youtube]

If you're not sure what you need and don't mind installing some extra dependencies, you can simply install everything:

pip install rikai[all]

Install from source

To build from source you'll need python as well as Scala with sbt installed:

# Clone the repo
git clone [email protected]:eto-ai/rikai rikai
# Build the jar
sbt publishLocal
# Install python package
cd python
pip install -e . # pip install -e .[all] to install all optional extras (see "Install from pypi")

rikai's People

Contributors

bobingm avatar changhiskhan avatar eddyxu avatar smellslikeml avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.