Giter VIP home page Giter VIP logo

circus-train-bigquery's Introduction

Circus Train BigQuery To Hive Replication

Overview

This Circus Train plugin enables the conversion of BigQuery tables to Hive.

Start using

You can obtain Circus Train BigQuery from Maven Central:

Maven Central Build Status build Coverage Status GitHub license

Installation

In order to be used by Circus Train the above circus-train-bigquery jar file must be added to Circus Train's classpath. It is highly recommended that the version of this library and the version of Circus Train are identical. The recommended way to make this extension available on the classpath is to store it in a standard location and then add this to the CIRCUS_TRAIN_CLASSPATH environment variable (e.g. via a startup script):

export CIRCUS_TRAIN_CLASSPATH=$CIRCUS_TRAIN_CLASSPATH:/opt/circus-train-big-query/lib/*

Another option is to place the jar file in the Circus Train lib folder which will then automatically load it but risks interfering with any Circus Train jobs that do not require the extension's functionality.

Configuration

  • Add the following to the Circus Train YAML configuration in order to load the BigQuery extension via Circus Train's extension loading mechanism:
     extension-packages: com.hotels.bdp.circustrain.bigquery
  • Configure Circus Train as you would for a copy job from Google Cloud Configuration
  • Provide the Google Cloud project ID that your BigQuery instance resides in as your source-catalog hive-metastore-uris parameter using the format hive-metastore-uris: bigquery://<project-id>
  • To enable copying to Google Storage provide a path to your Google Credentials in the configuration under the gcp-security parameter.
  • Provide your BigQuery dataset as source-table database-name and your BigQuery table name as source-table table-name

Partition Generation

Circus Train BigQuery allows you to add partitions for one column to your Hive destination table upon replication from BigQuery to Hive. The partition must be a field present on your source BigQuery table. The user can configure the partition field by setting the table-replications[n].copier-option : circus-train-bigquery-partition-by property on a specific table replication within the Circus Train configuration file. The destination data will be repartitioned on the specified field.

If your destination data is partitioned you can also specify a partition filter using the table-replications[n].copier-option : circus-train-bigquery-partition-filter property. The partition filter is a BigQuery SQL query which will be used to filter the data being replicated - i.e. so one doesn't have to replicate the entire table on each run. You can think of the partition filter being executed as select * from bq_db.table where ${filter-condition}, with the ${filter-condition} being substituted by your partition filter. Only the rows returned by this query will be replicated. See partition filters in the main Circus Train documentation for more information.

Examples:

Simple Configuration

extension-packages: com.hotels.bdp.circustrain.bigquery
source-catalog:
  name: my-google-source-catalog
  hive-metastore-uris: bigquery://my-gcp-project-id
replica-catalog:
  name: my-replica-catalog
  hive-metastore-uris: thrift://internal-shared-hive-metastore-elb-123456789.us-west-2.elb.foobaz.com:9083

gcp-security:
  credential-provider: /home/hadoop/.gcp/my-gcp-project-01c26fd71db7.json 

table-replications:
- source-table:
    database-name: mysourcedb
    table-name: google_ads_data
  replica-table:
    database-name: myreplicadb
    table-name: bigquery_google_ads_data
    table-location: s3://mybucket/foo/baz/

Configuration with partition generation configured

extension-packages: com.hotels.bdp.circustrain.bigquery
source-catalog:
  ... see above ...

table-replications:
  -
    source-table:
      database-name: bdp
      table-name: bigquery_source
    replica-table:
      database-name: bdp
      table-name: hive_replica
      table-location: s3://bucket/foo/bar/
    copier-options:
      circustrain-bigquery-partition-by: date

Configuration with partition filter configured

extension-packages: com.hotels.bdp.circustrain.bigquery
source-catalog:
  ... see above ...

table-replications:
  -
    source-table:
      database-name: bdp
      table-name: bigquery_source
    replica-table:
      database-name: bdp
      table-name: hive_replica
      table-location: s3://bucket/foo/bar/
    copier-options:
      circustrain-bigquery-partition-by: date
      circustrain-bigquery-partition-filter: date BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -150 DAY) AND CURRENT_TIMESTAMP()

Technical Overview

The BigQuery plugin works by extracting the BigQuery table data into Google Storage using Google APIs, and then using Circus Train listeners to convert the BigQuery metadata into a Hive table object. The data is then replicated from source to replica using the metadata from this mocked Hive table.

Contact

Mailing List

If you would like to ask any questions about or discuss Circus Train BigQuery please join the main Circus Train mailing list at

https://groups.google.com/forum/#!forum/circus-train-user

Legal

This project is available under the Apache 2.0 License.

Copyright 2018-2020 Expedia, Inc.

circus-train-bigquery's People

Contributors

ananamj avatar cmathiesen avatar jaygreeeen avatar massdosage avatar s-hcom-ci avatar vedantchokshi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.