Giter VIP home page Giter VIP logo

lakehouse's Introduction

Lakehouse Playground

check

Supported Data Pipeline Components

Pipeline Component Version Description
Trino 425+ Query Engine
DBT 1.5+ Analytics Framework
Spark 3.3+ Computing Engine
Flink 1.16+ Computing Engine
Iceberg 1.3.1+ Table Format (Lakehouse)
Hudi 0.13.1+ Table Format (Lakehouse)
Airflow 2.7+ Scheduler
Jupyterlab 3+ Notebook
Kafka 3.4+ Messaging Broker
Debezium 2.3+ CDC Connector

Getting Started

Execute compose containers first.

# Use `COMPOSE_PROFILES` to select the profile
COMPOSE_PROFILES=trino docker-compose up;
COMPOSE_PROFILES=spark docker-compose up;
COMPOSE_PROFILES=flink docker-compose up;
COMPOSE_PROFILES=airflow docker-compose up;

# Combine multiple profiles
COMPOSE_PROFILES=trino,spark docker-compose up;

# for CDC environment (Kafka, ZK, Debezium)
make compose.clean compose.cdc

# for Stream environment (Kafka, ZK, Debezium + Flink)
make compose.clean compose.stream

Then access the lakehouse services.


CDC Starter kit

# Run cdc-related containers
make compose.cdc;

# Register debezium mysql connector using Avro Schema Registry
make debezium.register.customers;

# Register debezium mysql connector using JSON Format
make debezium.register.products;

Running Flink Applications

Flink supports Java 11 but uses Java 8 due to its SQL (Hive) dependency. The Flink SQL Application within this project is written in Kotlin for SQL Readability.

You can run it as an Application in IDEA. (it is not a Kotlin Application) For Flink Application, the required dependencies are already included within the Production Docker Image or EMR cluster.

Therefore, they are set as 'Provided' dependencies in the Maven project, so to run them locally, you can include the Add dependencies with "provided" scope to classpath" IDEA option as shown in the screenshot below.

After running the Local Flink Application, you can access the Flink Job Manager UI from localhost:8081.

idea

DBT Starter kit

# Run trino-related containers
make compose.dbt;

# Prepare iceberg schema
make trino-cli;
$ create schema iceberg.staging WITH ( LOCATION = 's3://datalake/staging' );
$ create schema iceberg.mart WITH ( LOCATION = 's3://datalake/mart' );

# Execute dbt commands locally
cd dbts;
dbt deps;
dbt run;
dbt test;
dbt docs generate && dbt docs serve --port 8070; # http://localhost:8070

# Select dbt-created tables from trino-cli
make trino-cli;
$ SELECT * FROM iceberg.mart.aggr_location LIMIT 10;
$ SELECT * FROM iceberg.staging.int_location LIMIT 10;
$ SELECT * FROM iceberg.staging.stg_nations LIMIT 10;
$ SELECT * FROM iceberg.staging.stg_regions LIMIT 10;

# Execute airflow dags for dbt
make airflow.shell;
airflow dags backfill dag_dbt --local --reset-dagruns  -s 2022-09-02 -e 2022-09-03;

Screenshots

Flink Job Manager UI

flink

Kafka UI

kafka

Minio UI

minio

Running Local Flink Application in IDEA

kafka

lakehouse's People

Contributors

1ambda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

lakehouse's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.