Giter VIP home page Giter VIP logo

apachebeam_gcpdataflow_demo's Introduction

ApacheBeam_GCPDataflow_Demo

This batch and streaming data processing demo is tested on GCP Dataflow, an Apache Beam Environment.

This demo is intendedly to show the procedure of a Apache Beam Developing instead of specific data processing logic. Actually the code was digested from a real project. So that the core processing logic has been intendedly deleted in order to avoid business information leak.

Why Beam? Beam=Batch+strEAM

Apache Beam (https://beam.apache.org/): An advanced unified programming model

Implement batch and streaming data processing jobs that run on any execution engine.

Apache Beam supports distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.

So that Apache Beam application can be deployed on Google Cloud Dataflow as a SERVERLESS solution.

Also, Apache Beam batch application can work in Colab.

Step 1: Imperative code to process data

Apache Beam is a kind of functional processing platform. So it is hard to debug Beam code. I suggest to start with an imperative code. And it is easy to use Jupyter Notebook to start if you choose Python as a working lanuage.

Step 2: Batch implementation

In order to provide a Serverless solution, this task was completed as a notebook. This notebook was developed and tested on Colab (https://colab.research.google.com/) along with an Apache Beam DirectRunner.

Colab will raise some errors for the Apache beam installation. Please ignore them, which won't affect the following functions.

Step 3: Streaming Implementation

The same code in an independent file, Streaming.py, can be run in streaming mode. The code supports Google PubSub as streaming source.

The streaming implementation is tested on Google Cloud Dataflow platform.

The Google cloud environment will cost a few dollars.

The Beam environment

The Beam environment can be set up by following this tutorial (https://console.cloud.google.com/getting-started?walkthrough_tutorial_id=python_dataflow_quickstart&_ga=2.25278013.1291490325.1607857566-321077206.1587778923 )

The streaming source

The stream source can be simulated by a Google Dataflow template, Text Files on Cloud Storage to Pub/Sub (Stream) (https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#gcstexttocloudpubsubstream )

Command to run

After configuration, such as streamingMode, topic name, access control, just use the following command to run:

python3 Streaming.py

Things to consider

How would you deal with data which arrive out of order?

So far, the implement code only considers basic fault tolerance, such as bid/ask size cannot be negative, even an order size could be negative in the situation of outing of order.

For serious situation of outing of order, the data can be recorded and replyed. In order to save cost, the history can be replayed PER order.

In the streaming mode, some source, such as Google PubSub, can be polled to replay the history.

Architecture

Due to the shared order book in the memory, the implementation is not scalable.

Even a STATEFUL Apache Beam doesn't provide the scalability.

In order to be scalable, the shared order book should be put into an external persistent data base, such as Google BigTable.

It depends on the business requirement. If one node can deal with the data in time and the history can be quickly replayed to rebuid the order book. One node model can be accepted.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.