Giter VIP home page Giter VIP logo

gcp-serverless-data-pipeline's Introduction

Serverless data pipeline with Cloud Functions, Pub/Sub and BigQuery on GCP

This project aims to show how to implement a simple data pipeline on GCP using some of its serverless services: Cloud Functions, Pub/Sub, Cloud Scheduler, and BigQuery.

Introduction

The pipeline consists of a process that regularly gets data from an API and loads it into BigQuery. Considering its popularity, the current weather data API by OpenWeatherMap was chosen to exemplify the data gathering stage.

Reference architecture

The next image shows the reference architecture for this project.

Architecture

About the pipeline

The process could be explained by the next steps:

  1. Depending on the frequency, a job of Cloud Scheduler triggers a topic on Cloud Pub/Sub.
  2. That action executes a Cloud Function (loadDataIntoBigQuery) that gets data from OpenWeatherMap.
  3. Then, this data is loaded into BigQuery.
  4. Finally, the data could be analyzed directly BigQuery or Data Studio.

System requirements

The following is needed to deploy the services:

  1. A GCP project with a linked billing account
  2. Installed and initialized the Google Cloud SDK
  3. Created an App Engine app in your project. Why?
  4. Enabled the Cloud Functions, Cloud Scheduler, and APP Engine APIs
  5. An API Key from OpenWeatherMap

Costs

This pipeline uses billable components of Google Cloud Platform, including:

  • Google Cloud Functions
  • Google Cloud Pub/Sub
  • Google Cloud Scheduler
  • Google BigQuery

Deployment

This section shows you how to deploy all the services needed to run the pipeline.

Setting up environment variables

Before continue, is preferable to set up some environment variables that will help you execute the gcloud commands smoothly.

export PROJECT_ID=<Your_Project_Id>

# The topic name fo Pub/Sub.
export TOPIC_NAME=<Your_Pub_Sub_Topic>

# It must be unique in the project. Note that you cannot re-use a job name in a project even if you delete its associated job.
export JOB_NAME=<Your_Cron_Scheduler_Job_Name>

# The name of the function corresponds to the exported function name on index.js
export FUNCTION_NAME="loadDataIntoBigQuery"

# E.g., if you want a frequency of execution of 1 hour, the variable should be SCHEDULE_TIME="every 1 hour".
export SCHEDULE_TIME=<Your_Cron_Schedule>

# OpenWeatherMap API key
export OPEN_WEATHER_MAP_API_KEY=<Your_Open_Weather_Map_Api_Key>

# Consider that dataset names must be unique per project. Dataset IDs must be alphanumeric (plus underscores)
export BQ_DATASET=<Your_BQ_Dataset_Name>

#The table name must be unique per dataset.
export BQ_TABLE=<Your_BQ_Table_Name>

1. Activate the project

gcloud config set project $PROJECT_ID

2. Create the Cloud Pub/Sub topic

gcloud pubsub topics create $TOPIC_NAME

3. Create the Cloud Scheduler job

gcloud scheduler jobs create pubsub $JOB_NAME --schedule="$SCHEDULE_TIME" --topic=$TOPIC_NAME --message-body="execute"

If you want to change the frequency of the execution, the following command will help:

gcloud scheduler jobs update pubsub $JOB_NAME --schedule="$SCHEDULE_TIME"

4. Create a BigQuery dataset

bq mk $BQ_DATASET

5. Create a BigQuery table

bq mk --table $PROJECT_ID:$BQ_DATASET.$BQ_TABLE

6. Deploy the Cloud Function

gcloud functions deploy $FUNCTION_NAME --trigger-topic $TOPIC_NAME --runtime nodejs10 --set-env-vars OPEN_WEATHER_MAP_API_KEY=$OPEN_WEATHER_MAP_API_KEY,BQ_DATASET=$BQ_DATASET,BQ_TABLE=$BQ_TABLE

What now

I want to write this section only as an opinion and give ideas on how to end this pipeline as a real queen or king of data.

Also, you have to consider that this particular stage depends totally on the data or insights you want to obtain. Felipe Hoffa illustrates different use cases and ideas using BigQuery, you should read him on Medium!

Query your table

Two options (clearly more).

First, remember the env variables? they are still util. if you run the next command, a BigQuery job will be executed that consist of a query to count all the records on your table. If you complete the steps above correctly, you will see at least one record counted.

bq query --nouse_legacy_sql "SELECT COUNT(*) FROM $BQ_DATASET.$BQ_TABLE"

Second, BigQuery on the GCP Console is also an enjoyable manner to explore and analyze your data.

Data Studio, the great finale

Day to day, Google's technological ecosystem grows rapidly. This project is a small, but concise, proof of how completed could be an end to end data solution built into this ecosystem.

Just to try (you should do it), I built a report on Data Studio and was a great and fast experience. In my opinion, the analytical power of BigQuery combined with its report/dashboard tool is the perfect double for small and big data end processes. Look at this report, just 20-30 minutes of learning by doing, connected directly to BigQuery!

Data Studio

This is not propaganda, Google didn't pay me for this (unfortunately).


Authors

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Further readings

gcp-serverless-data-pipeline's People

Contributors

jovald avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.