Giter VIP home page Giter VIP logo

bigquery-labs's Introduction

bigquery-labs

GCP BigQuery labs using the Java SDK.

About

Made with:

  • Adoptium Temurin OpenJDK 11.0.17
  • Spring Boot v2.4.3
  • Apache Maven 3.8.7
  • IntelliJ IDEA 2023.1 (Ultimate Edition)
  • GoLand 2023.1 (Ultimate Edition)
  • PyCharm 2023.1 (Ultimate Edition)

Install & build

Download, install, and initialize the gcloud SDK on your local machine

Refer to the gcloud CLI documentation to complete this step.

Install the gcloud SDK to the user's home directory (e.g., /Users/USERNAME/google-cloud-sdk).

When it's finished installing, add the gcloud executable to your system's $PATH and run the command:

gcloud init
gcloud CLI: Generate an Application Default Credentials (ADC) access token

If you're running the application locally, you can use the following command to generate an access token:

export GCP_ADC_ACCESS_TOKEN="$(gcloud auth application-default print-access-token)"
gcloud CLI: Generate a GCP service account access token

Run this command to generate an access token for a specific GCP service account:

export GCP_SA_ACCESS_TOKEN=$(gcloud auth print-access-token --impersonate-service-account='GCP_SA_EMAIL_ADDRESS')

Replace the following:

  • GCP_SA_EMAIL_ADDRESS: the email address of the service account to impersonate.

Example:

export GCP_SA_ACCESS_TOKEN=$(gcloud auth print-access-token --impersonate-service-account='[email protected]')
Create and store a service account key

This section refers to usage of a GCP service account key (.json) file stored on your local file system.

To map a local gcloud installation to a volume on a container instance running the application, include the -v parameter in the docker run command used to start a container instance, as described below.

macOS

Assuming the user's service account key file is stored in the same directory as their local gcloud installation:

/Users/USERNAME/.config/gcloud

export LOCAL_GCLOUD_AUTH_DIRECTORY=$HOME/.config/gcloud

and the target volume on the container instance is:

/root/.config/gcloud

export CONTAINER_GCLOUD_AUTH_DIRECTORY=/root/.config/gcloud

the command to run the container instance would be:

docker run --rm -it \
  -e GCP_SA_KEY_PATH=$GCP_SA_KEY_PATH \
  -e GCP_ADC_ACCESS_TOKEN=$GCP_ADC_ACCESS_TOKEN \
  -e GCP_SA_ACCESS_TOKEN=$GCP_SA_ACCESS_TOKEN \
  -e GCP_DEFAULT_USER_PROJECT_ID=$GCP_DEFAULT_USER_PROJECT_ID \
  -e GCP_DEFAULT_USER_DATASET=$GCP_DEFAULT_USER_DATASET \
  -e GCP_DEFAULT_USER_TABLE=$GCP_DEFAULT_USER_TABLE \
  -e GCP_SA_PROJECT_ID=$GCP_SA_PROJECT_ID \
  -e GCP_SA_DATASET=$GCP_SA_DATASET \
  -e GCP_SA_TABLE=$GCP_SA_TABLE \
  -v ${LOCAL_GCLOUD_AUTH_DIRECTORY}:${CONTAINER_GCLOUD_AUTH_DIRECTORY} \
  -v ${LOCAL_MAVEN_REPOSITORY}:${CONTAINER_MAVEN_REPOSITORY} \
  bigquery-labs

Replace the following in the path to the gcloud directory:

  • USERNAME: the current OS user's username

so that the path to the service account key file is correct, e.g.:

/Users/squidmin/.config/gcloud/sa-private-key.json

Read here for more information about creating service account keys.

Read here for more information about run config CLI arguments.

Add the Maven wrapper

Ensure that Maven is already installed on the machine that will run the container.

In the root of this project, run the command:

mvn wrapper:wrapper

Read here for more information about installing Maven.

Build a JAR
./mvnw clean package -P integration \
  -DGCP_SA_KEY_PATH=$GCP_SA_KEY_PATH \
  -DGCP_ADC_ACCESS_TOKEN=$GCP_ADC_ACCESS_TOKEN \
  -DGCP_SA_ACCESS_TOKEN=$GCP_SA_ACCESS_TOKEN \
  -DGCP_DEFAULT_USER_PROJECT_ID=$GCP_DEFAULT_USER_PROJECT_ID \
  -DGCP_DEFAULT_USER_DATASET=$GCP_DEFAULT_USER_DATASET \
  -DGCP_DEFAULT_USER_TABLE=$GCP_DEFAULT_USER_TABLE \
  -DGCP_SA_PROJECT_ID=$GCP_SA_PROJECT_ID \
  -DGCP_SA_DATASET=$GCP_SA_DATASET \
  -DGCP_SA_TABLE=$GCP_SA_TABLE

Or use mvn clean install:

./mvnw clean install -P integration \
  -DGCP_SA_KEY_PATH=$HOME/.config/gcloud \
  -DGCP_ADC_ACCESS_TOKEN="$(gcloud auth application-default print-access-token)" \
  -DGCP_SA_ACCESS_TOKEN=placeholder \
  -DGCP_DEFAULT_USER_PROJECT_ID=lofty-root-378503 \
  -DGCP_DEFAULT_USER_DATASET=test_dataset_integration \
  -DGCP_DEFAULT_USER_TABLE=test_table_integration \
  -DGCP_SA_PROJECT_ID=placeholder \
  -DGCP_SA_DATASET=placeholder \
  -DGCP_SA_TABLE=placeholder

Read here for more information about run config CLI arguments.

Build a container image
docker build \
  --build-arg GCP_SA_KEY_PATH=${GCP_SA_KEY_PATH} \
  --build-arg GCP_DEFAULT_USER_PROJECT_ID=${GCP_DEFAULT_USER_PROJECT_ID} \
  --build-arg GCP_DEFAULT_USER_DATASET=${GCP_DEFAULT_USER_DATASET} \
  --build-arg GCP_DEFAULT_USER_TABLE=${GCP_DEFAULT_USER_TABLE} \
  --build-arg GCP_SA_PROJECT_ID=${GCP_SA_PROJECT_ID} \
  --build-arg GCP_SA_DATASET=${GCP_SA_DATASET} \
  --build-arg GCP_SA_TABLE=${GCP_SA_TABLE} \
  -t bigquery-labs .

Read here for more information about run config CLI arguments.


Run the application

1a. Prepare the host machine environment

Expand

Until the main application entrypoint is developed, run the application's functionality via the mvn test interface.

Pass required environment variables on your local system to the VM options, as shown in the createTableWithCustomSchema example below:

./mvnw \
  -Dtest=BigQueryAdminClientIntegrationTest#listDatasets \
  test -P integration \
  -DGCP_SA_KEY_PATH=$GCP_SA_KEY_PATH \
  -DGCP_ADC_ACCESS_TOKEN=$GCP_ADC_ACCESS_TOKEN \
  -DGCP_SA_ACCESS_TOKEN=$GCP_SA_ACCESS_TOKEN \
  -DGCP_DEFAULT_USER_PROJECT_ID=$GCP_DEFAULT_USER_PROJECT_ID \
  -DGCP_DEFAULT_USER_DATASET=$GCP_DEFAULT_USER_DATASET \
  -DGCP_DEFAULT_USER_TABLE=$GCP_DEFAULT_USER_TABLE

1b. Prepare the container environment

Expand

Environment variables

The GCP_SA_KEY_PATH environment variable is used to store the path to the user's GCP service account key file. The user's service account key file is mapped to the /root/.config/gcloud directory on the container instance.

export GCP_SA_KEY_PATH=/root/.config/gcloud/sa-private-key.json

The GCP_ADC_ACCESS_TOKEN environment variable is used to store an OAuth2 access token for reaching BigQuery RESTful services using Application Default Credentials (ADC).

export GCP_ADC_ACCESS_TOKEN=$(gcloud auth application-default print-access-token)

The GCP_SA_ACCESS_TOKEN environment variable is used to store an OAuth2 access token for reaching BigQuery RESTful services as a specific service account.

export GCP_SA_ACCESS_TOKEN=$(gcloud auth print-access-token --impersonate-service-account="[email protected]")

Utility script (WIP)

The run.sh script at the root level of the project will set required environment variables automatically.

It accepts short and long arguments for each environment variable.

2. Run an interactive container instance

Expand

docker run

docker run --rm -it \
  -e GCP_SA_KEY_PATH=$GCP_SA_KEY_PATH \
  -e GCP_ADC_ACCESS_TOKEN=$GCP_ADC_ACCESS_TOKEN \
  -e GCP_SA_ACCESS_TOKEN=$GCP_SA_ACCESS_TOKEN \
  -e GCP_DEFAULT_USER_PROJECT_ID=$GCP_DEFAULT_USER_PROJECT_ID \
  -e GCP_DEFAULT_USER_DATASET=$GCP_DEFAULT_USER_DATASET \
  -e GCP_DEFAULT_USER_TABLE=$GCP_DEFAULT_USER_TABLE \
  -e GCP_SA_PROJECT_ID=$GCP_SA_PROJECT_ID \
  -e GCP_SA_DATASET=$GCP_SA_DATASET \
  -e GCP_SA_TABLE=$GCP_SA_TABLE \
  -v ${LOCAL_GCLOUD_AUTH_DIRECTORY}:${CONTAINER_GCLOUD_AUTH_DIRECTORY} \
  -v ${LOCAL_MAVEN_REPOSITORY}:${CONTAINER_MAVEN_REPOSITORY} \
  bigquery-labs

Read here for more information about run config CLI arguments.


./run.sh utility script

./run.sh \
  -dpid $GCP_DEFAULT_USER_PROJECT_ID \
  -sakp $HOME/.config/gcloud/sa-private-key.json \
  -saat access_token_placeholder \
  -adcat $(gcloud auth application-default print-access-token)

Or use long arguments:

./run.sh \
  --GCP_DEFAULT_USER_PROJECT_ID $GCP_DEFAULT_USER_PROJECT_ID \
  --GCP_SA_KEY_PATH $HOME/.config/gcloud/sa-private-key.json \
  --GCP_SA_ACCESS_TOKEN access_token_placeholder \
  --GCP_ADC_ACCESS_TOKEN $(gcloud auth application-default print-access-token)

run.sh options

Expand

run.sh implements the following options:

  • --default: Start the application on the user's host system with default run environment settings.

    Example
    ./run.sh --default
  • -ci, --container-instance: Run a container instance pointing to the root directory of the application.

    Example
    ./run.sh --container-instance
  • -nci, --no-container-instance: Build and run the application without starting a container instance.

    Example
    ./run.sh -nci
  • -i GCP_SA_EMAIL_ADDRESS, --impersonate GCP_SA_EMAIL_ADDRESS: Impersonate a GCP service account.

    Replace GCP_SA_EMAIL_ADDRESS with the email address of the service account to impersonate.

    Example
    ./run.sh --default -ci --impersonate-service-account GCP_SA_EMAIL_ADDRESS

    Replace the following:

    • GCP_SA_EMAIL_ADDRESS: the impersonated service account's email address.

Other examples

Build JAR, build image, & start an interactive container instance with default run environment settings
./run.sh --default --container-instance
./run.sh --default -ci
Build and run JAR with default run environment settings
./run.sh --default
./run.sh --default --no-container-instance
./run.sh --default -nci

3. Run the JAR

Note: This section currently is only used for testing the main application entrypoint.

This project currently invokes the BigQuery Java SDK (and later the BigQuery RESTful services) via the mvn test interface. Use that for now.

Refer to: /src/test README.md.

Using "exec java" command. Specify a profile.
exec java -jar \
  -Dspring.profiles.active=$PROFILE \
  -DGCP_SA_KEY_PATH=$GCP_SA_KEY_PATH \
  -DGCP_ADC_ACCESS_TOKEN=$GCP_ADC_ACCESS_TOKEN \
  -DGCP_SA_ACCESS_TOKEN=$GCP_SA_ACCESS_TOKEN \
  -DGCP_DEFAULT_USER_PROJECT_ID=$GCP_DEFAULT_USER_PROJECT_ID \
  -DGCP_DEFAULT_USER_DATASET="test_dataset_integration" \
  -DGCP_DEFAULT_USER_TABLE="test_table_integration_custom" \
  ./target/bigquery-labs-0.0.1-SNAPSHOT.jar
With Maven. Specify a profile.
mvn spring-boot:run \
  -Dspring-boot.run.profiles=$PROFILE \
  -DGCP_SA_KEY_PATH=$GCP_SA_KEY_PATH \
  -DGCP_ADC_ACCESS_TOKEN=$GCP_ADC_ACCESS_TOKEN \
  -DGCP_SA_ACCESS_TOKEN=$GCP_SA_ACCESS_TOKEN \
  -DGCP_DEFAULT_USER_PROJECT_ID=$GCP_DEFAULT_USER_PROJECT_ID \
  -DGCP_DEFAULT_USER_DATASET="test_dataset_integration" \
  -DGCP_DEFAULT_USER_TABLE="test_table_integration_custom"

Table admin

bq CLI

List datasets
bq ls --filter labels.key:value \
  --max_results INTEGER \
  --format=prettyjson \
  --project_id PROJECT_ID

Replace the following:

  • key:value: a label key and value, if applicable.
  • INTEGER: an integer representing the number of datasets to list.
  • PROJECT_ID: the name of the GCP project containing the datasets to list.

Examples:

bq ls --format=pretty
Create a dataset

Refer to the GCP documentation for creating datasets.

Examples:

bq --location=us mk \
  --dataset \
  --default_partition_expiration=3600 \
  --default_table_expiration=3600 \
  --description="An example." \
  --label=test_label_1:test_value_1 \
  --label=test_label_2:test_value_2 \
  --max_time_travel_hours=168 \
  --storage_billing_model=LOGICAL \
  ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}

The Cloud Key Management Service (KMS) key parameter (KMS_KEY_NAME) can be specified. This parameter is used to pass the name of the default Cloud Key Management Service key used to protect newly created tables in this dataset. You cannot create a Google-encrypted table in a dataset with this parameter set.

bq --location=us mk \
  --dataset \
  --default_kms_key=KMS_KEY_NAME \
  ... \
  ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}
Delete a dataset

Refer to the GCP documentation for deleting a dataset.

Examples:

Remove all tables in the dataset (-r flag):

bq rm -r -f -d ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}
Create a table with a configured schema

Create an empty table with an inline schema definition

bq mk --table PROJECT_ID:DATASET.TABLE SCHEMA

Replace the following:

  • PROJECT_ID: the name of the GCP project to target.
  • DATASET: the name of the BigQuery dataset to target.
  • TABLE: the name of the BigQuery table to target.
  • SCHEMA: an inline schema definition.

Example:

bq mk --table \
  ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}.test_table_name_lofty \
  id:STRING,creation_timestamp:DATETIME,last_update_timestamp:DATETIME,column_a:STRING,column_b:BOOL

Specify the schema in a JSON schema file

For an example JSON schema file, refer to: /schema/example.json.

Create an empty table

bq mk --table \
  PROJECT_ID:DATASET.TABLE \
  ./path/to/schema/file.json

Example:

bq mk --table \
  ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}.test_table_name_lofty \
  ./schema/example.json

Create a table with CSV data

bq --location=location load \
  --source_format=format \
  PROJECT_ID:DATASET.TABLE \
  ./path/to/data/file.csv \
  ./path/to/schema/file.json

Example:

bq --location=us load \
  --source_format=CSV \
  ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}.test_table_name_lofty \
  ./csv/example.csv \
  ./schema/example.json

Refer to the BigQuery documentation: Details of loading CSV data.

Delete a table
bq rm --table ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}.test_table_name_lofty
Show table schema

Example:

bq show \
  --schema \
  --format=prettyjson \
  ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}.test_table_name_lofty

The table schema can be written to a file:

bq show \
  --schema \
  --format=prettyjson \
  ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}.test_table_name_lofty \ > ./schema/example_show-write.json
Modify table schemas
bq update \
  ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}.test_table_name_lofty \
  ./schema/example_update.json

Refer to the GCP documentation on modifying table schemas..

Insert data into a table

Examples:

Insert for known values:

bq insert ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}.test_table_name_lofty ./json/example.json

Specify a template suffix (--template_suffix or -x):

bq insert --ignore_unknown_values \
  --template_suffix=_insert \
  ${GCP_DEFAULT_USER_PROJECT_ID}:${GCP_DEFAULT_USER_DATASET}.test_table_name_lofty \
  ./json/example.json

Refer to the bq insert documentation.

Run an interactive query
bq query \
  --use_legacy_sql=false \
  'query_string'

Example:

bq query \
  --use_legacy_sql=false \
  'SELECT
    id, column_b
  FROM
    `lofty-root-378503.test_dataset_name_lofty.test_table_name_lofty`
  LIMIT
    3;'

bigquery-labs's People

Contributors

squidmin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.