Austin Service Explorer

1. Project's description

This project is a part of the Data Engineering Zoomcamp. The main goal of the project is to seamlessly transfer data from the source into a data analytics dashboard using data engineering tools. The project uses the City of Austin Open Data portal to collect over 1.8 million service requests originating from the city and its metropolitan area. These requests date back to 01/03/2014 and are updated daily at 4:00 am. The dataset contains various service requests made by residents, covering issues ranging from park maintenance to broken traffic lights and beyond.

The project attempts to collect the city's service requests data, clean, validate and prepare it for data analysis. Airflow is used as a main orchestration tool to coordinate a workflow: data extraction, preprocessing, storage, and transformation.

Extracting:

The data was retrieved through API requests.

Preprocessing:

The correct data types were assined to the collected data, performed formatting, dropped non-meaningful columns and redundant information.

Storage:

The preprocessed data is stored on the Google Cloud Storage bucket as a collection multiple parquet files.

Transformation:

With the help of PySpark, the data was cleaned and standardized to ensure its quality. For example, values like Austin, AUSTIN, AUS, and austin were replaced with only one standard Austin, and all errors in city names and locations were removed. In the data transformation step, I performed as well feature engineering and created new columns, for example: generalizing types of service requests, bringing them down to phone, email, web, app, and other; created the column that calculates how long the case was opened, extracted the values like month and year.

Load:

The transformed data has been saved on the GCS bucket and then loaded into BigQuery as an external table. With the help of SQL query I created as well a table that is partitioned by date, using monthly intervals, and clustered by the method, the service request was received. These transformations help to optimize the SQL queries' performance.

2. Technology stack of the project

Infrastracture as Code: Terraform
Data Lake: Google Cloud Storage
Data Warehouse: BigQuery
Orchestration: Airflow
Data Transformations: Spark
Serverless Cluster Service: Dataproc
Containerization: Docker
Data Visualisation: Looker

3. Prerequisites

Expand

To reproduce this project you need to have a Google Cloud Account (additional cost may apply), have Docker and Docker Compose installed, have at least 5GB of free disk space to load docker images and start the project.

4. Installations

Please ensure that you have all the necessary installations and install any missing ones by following the instructions provided below.

Docker

Expand

sudo apt-get update
sudo apr-get install docker.io

docker-compose

Expand

Check if you have ~/bin/ directory. If not, create on with the command cd && mkdir bin
Move to the bin/ with the command cd bin.
Download the binary file from Docker's GihHub repository.

wget https://github.com/docker/compose/releases/download/v2.24.1/docker-compose-linux-x86_64

Rename the file to docker-compose
Make it executable chmod +x docker-compose
Open .bashrc file by running command nano ~/.bashrc and add the following line of code in the end of the file:

export PATH="${HOME}/bin:${PATH}"

Run source ~/.bashrc
Check if docker compose works:
- which docker-compose should return the path to ~/bin/docker-compose
- docker-compose --version should return the version you installed.

Terraform

Expand

Download a binary file from Terraform site
Unzip it into the ~/bin directory.
Check the terraform version by running a command terraform --version

Google Cloud SDK - Optional

Expand

Google Cloud SDK will be installed automatically on project's Docker Image. If you'd like to install it locally as well, follow the official installation guide from [Google Cloud](https://cloud.google.com/sdk/docs/install)

5. Google Cloud Credentials

To use Google Cloud SDK and Terraform, you need to grant access to your Google Cloud Account. Follow the steps below to prepare your GCP account for the project.

Expand

On GCP create a project.
Enable API's for your project:
- Identity and Access Management (IAM) API
- IAM Service Account Credentials API
- Compute Engine API
- Cloud Dataproc API
- BigQuery API
- Bigquery Storage API
Navigate to IAM & Admin -> Service Accounts.
Create a new service account for the project. Make sure to add the following roles into it:
- BigQuery Admin
- Compute Admine
- Storage Admin
- Storage Object Admin
- Dataproc Administrator
- Owner
Click on the name of the service account and move to keys. Create a *.json file with a key, download it, and rename as apd311.json. In the root dicrectory create a folder .gc and move the key file into that folder:

mkdir ~/.gc
mv apd311.json ~/.gc/apd311.json

Alternativaly, you can store it in your preffered location and replace the path to the file in the docker-compose and .env files.

The next 2 steps are valid only for those, who choose to install Google SDK locally: 6. To add the environment variable with your credentials run in the terminal:

export GOOGLE_APPLICATION_CREDENTIALS="~/.gc/apd311.json"

If you don't want to manually add credentials path each session, you can add the code above into ~/.bashrc file. 7. Authentificate Google SDK.

gcloud auth application-default login

6. Reproduce the project

Clone the project from GitHub

https://github.com/nadia-paz/apd311.git

or download it by clicking on Code -> Download ZIP

In the terminal move into the project's folder apd311.

Note!!! GCP buckets and project names are globally unique. You won't be able to create and/or use the same variables that I do. For reproducing the code you'll need to replace GCP_PROJECT_ID and GCP_GCS_BUCKET values with your own. Put your caustom values in docker-compose.yaml file located in airflow directory (section x-airflow-common: -> environment). If you stored and named the secret key other than ~/.gc/apd311.json, you need to add custom value in volumes section as well (In the code lines: 11,12,14,15,30). You can move to the next step only after you updated the information.

Move to terraform directory. Run the commands terraform init and terraform apply.
Move back to the airflow directoy cd .. && cd airflow
In addition to dags directory, you'll need to create directories plugins andlogs.

mkdir plugins logs

Note for Linux users!!! You might need to do some workaround before starting Docker. Run in terminal

echo $(id -u)

In the .envfile replace the default numeric value of AIRFLOW_UID with the one you've got in the terminal.

Now we are ready to start our Docker. First, we need to rebuild the official airflow image to make it fit our needs. Then we are ready to start Airflow. In the airflow directory step by step run the commands:

sudo docker build . -t airflow-plus:latest
docker-compose up airflow-init
docker-compose up -d

With the command docker ps check if all containers started. You should see:

airflow-webserver
airflow-scheduler
airflow-init
postgres

If everything runs OK, we can login into Airflow Web. In the browser go to http://localhost:8080, enter the login: airflow, and password: airflow. On the home page you will see 3 DAGs (Directed Acyclic Graphs), that organize tasks together. Activate all DAGs by clicking on toggle button next to them.

DAGs and Tasks

All DAGs are scheduled to run. Alternatively, you can manually trigger them in the order:

upload_spark_file >> pipeline >> create_tables

`upload_spark_file`

Runs -> once on April, 11 2024, not scheduled. Contains one task spark_job_file_task that uploads the file spark_job.py to Google Cloud Storage.

`pipeline`

Runs -> scheduled to run every week on Sunday at midnight GMT.
Start date: April, 12 2024.
End date: June 30, 2024
Tasks:
- save_data_task -> extracts data from City of Austin Open Data Portal and loads it onto GCS bucket.
- create_cluster_task -> creates Dataproc cluster on Google Cloud Platform
- submit_spark_job_task -> submits Spark Job into Dataproc cluster.
- delete_cluster_task -> deletes Dataproc cluster

`create_tables`

Runs -> scheduled to run every week on Sunday at 4:00 AM GMT.
Start date: April, 12 2024.
End date: June 30, 2024
Tasks:
- for external table:
  - create_stage_dataset_task
  - create_stage_table_task
- for partitioned table:
  - create_main_dataset_task
  - create_main_table_task

After the tasks finish their run, you can move to BigQuery to work with data. Look for dataset apd311, and table main_table.

7. Analyze data with Looker

To create your own report with Looker Studio, click Create , pick Data Source -> BigQuery -> Project Name -> Dataset apd311 -> Table main_table, press Create report. Make sure that you change all location fields to Geo data type. Press the ABC icon next to location_sity, location_county, location_zip fields, and the 123 icon next to location_lat and location_long. Pick GEO. Create a dashboard.

My Analytical Dashboard

8. Clean resources

In the terminal move to the terraform directory and run the command terraform destroy
Go to Google Cloud Storge and manually delete Dataproc clusters.
Move to airflow directory and disconnect Docker by runnining command

docker-compose down --rmi "all"  --volumes

--volume and -rmi are optional flags to remove all volumes (--volume) and images (-rmi) from your computer.

Delete the project's directory.

nadia-paz / apd311 Goto Github PK

apd311's Introduction

Austin Service Explorer

1. Project's description

2. Technology stack of the project

3. Prerequisites

4. Installations

Docker

docker-compose

Terraform

Google Cloud SDK - Optional

5. Google Cloud Credentials

6. Reproduce the project

DAGs and Tasks

upload_spark_file

pipeline

create_tables

7. Analyze data with Looker

8. Clean resources

apd311's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

`upload_spark_file`

`pipeline`

`create_tables`