Giter VIP home page Giter VIP logo

blockchain-etl / blockchain-etl-streaming Goto Github PK

View Code? Open in Web Editor NEW
73.0 9.0 21.0 66 KB

Streaming Ethereum and Bitcoin blockchain data to Google Pub/Sub or Postgres in Kubernetes

Home Page: https://medium.com/google-cloud/live-ethereum-and-bitcoin-data-in-google-bigquery-and-pub-sub-765b71cd57b5

License: MIT License

Shell 11.93% Python 59.40% Mustache 28.67%
real-time bitcoin ethereum cryptocurrency apache-beam blockchain blockchain-analytics crypto data-analytics data-engineering

blockchain-etl-streaming's Introduction

Blockchain ETL Streaming

Streams the following Ethereum entities to Pub/Sub or Postgres using ethereum-etl stream:

  • blocks
  • transactions
  • logs
  • token_transfers
  • traces
  • contracts
  • tokens

Streams blocks and transactions to Pub/Sub using bitcoin-etl stream. Supported chains:

  • bitcoin
  • bitcoin_cash
  • dogecoin
  • litecoin
  • dash
  • zcash

Deployment Instructions

  1. Create a cluster:
gcloud container clusters create ethereum-etl-streaming \
--zone us-central1-a \
--num-nodes 1 \
--disk-size 10GB \
--machine-type custom-2-4096 \
--network default \
--subnetwork default \
--scopes pubsub,storage-rw,logging-write,monitoring-write,service-management,service-control,trace
  1. Get kubectl credentials:
gcloud container clusters get-credentials ethereum-etl-streaming \
--zone us-central1-a
  1. Create Pub/Sub topics (use create_pubsub_topics_ethereum.sh). Skip this step if you need to stream to Postgres.
  • "crypto_ethereum.blocks"
  • "crypto_ethereum.transactions"
  • "crypto_ethereum.token_transfers"
  • "crypto_ethereum.logs"
  • "crypto_ethereum.traces"
  • "crypto_ethereum.contracts"
  • "crypto_ethereum.tokens"
  1. Create GCS bucket. Upload a text file with block number you want to start streaming from to gs://<YOUR_BUCKET_HERE>/ethereum-etl/streaming/last_synced_block.txt.

  2. Create "ethereum-etl-app" service account with roles:

    • Pub/Sub Editor
    • Storage Object Admin
    • Cloud SQL Client

Download the key. Create a Kubernetes secret:

kubectl create secret generic streaming-app-key --from-file=key.json=$HOME/Downloads/key.json -n eth
  1. Install [helm] (https://github.com/helm/helm#install)
brew install helm
helm init  
bash patch-tiller.sh
  1. Copy example values directory to values dir and adjust all the files at least with your bucket and project ID.
  2. Install ETL apps via helm using chart from this repo and values we adjust on previous step, for example:
helm install --name btc --namespace btc charts/blockchain-etl-streaming --values values/bitcoin/bitcoin/values.yaml
helm install --name bch --namespace btc charts/blockchain-etl-streaming --values values/bitcoin/bitcoin_cash/values.yaml
helm install --name dash --namespace btc charts/blockchain-etl-streaming --values values/bitcoin/dash/values.yaml
helm install --name dogecoin --namespace btc charts/blockchain-etl-streaming --values values/bitcoin/dogecoin/values.yaml
helm install --name litecoin --namespace btc charts/blockchain-etl-streaming --values values/bitcoin/litecoin/values.yaml
helm install --name zcash --namespace btc charts/blockchain-etl-streaming --values values/bitcoin/zcash/values.yaml

helm install --name eth-blocks --namespace eth charts/blockchain-etl-streaming \ 
--values values/ethereum/values.yaml --values values/ethereum/block_data/values.yaml
helm install --name eth-traces --namespace eth charts/blockchain-etl-streaming \ 
--values values/ethereum/values.yaml --values values/ethereum/trace_data/values.yaml 

Ethereum block and trace data streaming are decoupled for higher reliability.

To stream to Postgres:

helm install --name eth-postgres --namespace eth charts/blockchain-etl-streaming \ 
--values values/ethereum/values-postgres.yaml

Refer to https://github.com/blockchain-etl/ethereum-etl-postgres for table schema and initial data load.

  1. Use describe command to troubleshoot, f.e.:
kubectl describe pods -n btc
kubectl describe node [NODE_NAME]

Refer to blockchain-etl-dataflow for connecting Pub/Sub to BigQuery.

blockchain-etl-streaming's People

Contributors

allenday avatar medvedev1088 avatar ninjascant avatar voron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

blockchain-etl-streaming's Issues

Implement Kubernetes manifests for ethereum-etl-streaming

A single Pod runs:

  1. download-last_synced_block - Job that downloads last_synced_block.txt from GCS bucket on node start up. See also init containers.
  2. streaming_service.py - streams blockchain data to PubSub
  3. health_checker.py - listens on HTTP port for health check requests
  4. gcs-syncd - uploads last_synced_block.txt periodically to GCS bucket and has a preStop hook that uploads the file to GCS bucket before Pod is terminated.
  5. fluentd daemon - uploads logs to GCS bucket

====================

Some code can be found here https://github.com/airswap/ethereum-etl/commit/cc7f86e139d317452c9cf0c975b99feb6ed8d7a0#diff-aa0ae5de65a67b108e5a26ddde9d3adf

minor - gcs_prefix inconsistent with readme.md

GCS_PREFIX in example_values/ethereum/trace_data/values.yaml inconsistent w/ readme step 4 (ethereum-etl/streaming/last_synced_block.txt)

GCS_PREFIX should read ethereum-etl/streaming

Investigate PubSub push subscription error: The supplied HTTP URL is not registered in the project that owns the subscription

There is no way to connect a cloud function with a PubSub in a different project:

Note: The Cloud Pub/Sub topic that your function is subscribed to must be in the same Google Cloud Platform project as your Cloud Function.

There is a way to use HTTP-triggered subscription to invoke a cloud function:

Note: You can also use HTTP-triggered functions to listen to Pub/Sub push subscriptions. This allows a single Cloud Function to subscribe to multiple topics.

However when trying to create an HTTP-triggered subscription an error occurs:

PubSub push subscription error: The supplied HTTP URL is not registered in the project that owns the subscription

It seems there is a need to register the endpoint: https://cloud.google.com/pubsub/docs/push#other-endpoints

This task is to follow the registration flow and see how simple/hard it is.

Implement streaming_service.py

Streaming service

  1. Continuously polls for new blocks using Ethereum JSON RPC API (e.g. every 10 seconds).
  2. Outputs data to GooglePubSubItemExporter, which publishes data to PubSub.
  3. Saves the last synchronized block to last_sync_block.txt file on every period.

This script can be used as a basis https://github.com/blockchain-etl/ethereum-etl/blob/feature/streaming/stream.py

================================

PubSub publisher guide https://cloud.google.com/pubsub/docs/publisher

There is some code in this fork https://github.com/airswap/ethereum-etl/commit/cc7f86e139d317452c9cf0c975b99feb6ed8d7a0#diff-a8e6a25bc32464a59378bac1befac9eb

TODO: Consider using geth subscriptions to handle chain reorganizations https://github.com/ethereum/go-ethereum/wiki/RPC-PUB-SUB#newheads

Add AWS compatibility

Is it possible to add S3 storage, kinesis data pipeline, as output or storage of current state ?

Implement health_checker.py

Health checker:

  1. Listens on an HTTP port for health check requests.
  2. Checks last_sync_block.txt file produced by streaming_service.py, looks at its last modified date .
  3. Reports UNHEALTHY if last_sync_block.txt was modified more than 2 minutes ago.
  4. Has minimal dependencies.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.