Giter VIP home page Giter VIP logo

lakehouse-poc's Introduction

Lakehouse

Docker compose stack to run an open-source lakehouse on a single machine using Prefect, Iceberg, Trino and Superset.

Prerequisites

In order to run the stack a Linux, Windows or MacOS computer with the following dependencies installed is needed:

  • docker
  • docker-compose
  • bash
  • wget

Notes:

  • Minimum Docker resources: 4 cores and 8Gb of RAM
  • Windows deployment has not been tested but it should work with WSL and Docker Desktop
  • The following ports should be available: 80, 4200, 5006, 8088

Getting Started

  • Clone this repository and cd into it.
  • Use the lakehouse.sh bash script to start/stop the stack. Options:
    • lakehouse.sh start: initialize environment (if needed) and start all services.
    • lakehouse.sh stop: stop all services.
    • lakehouse.sh restart: restart all services.
    • lakehouse.sh status: displays services' status.
    • lakehouse.sh reset: reset environment. All ingested data will be deleted.

Environment Configuration and Initizalization

The environment is configured by default to run on Docker Desktop. When running on Linux, you will need to edit .env file and set the value of variable HOST_OR_IP to the machine's ip or dns.

Example:

# HOST_OR_IP=host.docker.internal
HOST_OR_IP=10.10.10.10

No other config changes are needed. lakehouse.sh start initializes all services including databases, Prefect deployment registration, etc.

Initizalization will automatically run the first time the environment is started and the first time the enviroment is started after an environment reset.

Services

The LakeHouse stack contains the following services:

  • traefik: Reverse proxy.
  • iceberg: Iceberg metadata catalog.
  • minio: table storage for Iceberg.
  • trino: query engine.
  • superset: data exploration and visualization.
  • jupyter: notebook configured to access Iceberg and Trino.
  • prefect-server: workflow engine server used for data ingestion.
  • prefect-worker: workflow engine worker.
  • postgres: SQL database used by Prefect, Superset and Iceberg metadata catalog.

Some of the services provide user interfaces. These are their urls:

Alteratively, the following urls can be used after updating your /etc/hosts.

# /etc/hosts
127.0.0.1 jupyter.lakehouse.localhost prefect.lakehouse.localhost superset.lakehouse.localhost trino.lakehouse.localhost minio.lakehouse.localhost

Data Ingestion

When the enviroment is initialized, the Prefect Flow data-to-dashboard is automatically registered. This flow ingests csv or parquet files using DuckDB and creates a simple dashboard for the ingested data.

Ingesting sample data with Prefect

A sample dataset is provided and can be ingested following the following steps:

  • Navigate to Prefect's UI -> Deployments.
  • Click on the three dots to the right of the data-to-dashboard:dev deployment.
  • Select Quick run.

  • After the flow run is started, you can navigate to the Flow Runs section and see the flow's logs.

  • Once the flow run has successfully completed, navigate to Superset UI. A dashboard with a sample table chart should be available.

Ingesting sample data with Jupyter

A sample jupyter notebook is provided with code to ingest and query a parquet file:

  • Navigate to Jupiter's UI.
  • Open the notebook `notebooks/lakehouse.ipynb.

  • Run the notebook.
  • Once the notebook successfully ran, navigate to Superset UI. A dashboard with a sample table chart should be available.

Ingesting your own data with Prefect

Custom csv or parquet files can be ingested with Prefect:

  • Copy the data file into folder data/datasets.
  • Navigate to Prefect's UI -> Deployments.
  • Click on the three dots to the right of the data-to-dashboard:dev deployment.
  • Select Custom run.
  • Set the url for your file. It should start with /lakehouse-poc/datasets/.

  • Set database name (Iceberg namespace) and table name.
  • All other parameters are optional and apply to csv files only.
  • Click on Submit.
  • After the flow run is started, you can navigate to the Flow Runs section and see the flow's logs.
  • Once the notebook successfully ran, navigate to Superset UI. A dashboard with a sample table chart should be available. A datetime column is required for the dashboard to work.

Future updates

  • Add Spark for ingestion and query.
  • Add option to use Nessie as Iceberg catalog. Currently waiting for pyiceberg to add support for it.

lakehouse-poc's People

Contributors

fraibacas avatar

Stargazers

 avatar Beni avatar Bill avatar

Watchers

 avatar Beni avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.