This project is a part of the Data Engineering Zoomcamp. The main goal of the project is to seamlessly transfer data from the source into a data analytics dashboard using data engineering tools. The project uses the City of Austin Open Data portal to collect over 1.8 million service requests originating from the city and its metropolitan area. These requests date back to 01/03/2014 and are updated daily at 4:00 am. The dataset contains various service requests made by residents, covering issues ranging from park maintenance to broken traffic lights and beyond.
The project attempts to collect the city's service requests data, clean, validate and prepare it for data analysis. Airflow is used as a main orchestration tool to coordinate a workflow: data extraction, preprocessing, storage, and transformation.
Extracting:
- The data was retrieved through API requests.
Preprocessing:
- The correct data types were assined to the collected data, performed formatting, dropped non-meaningful columns and redundant information.
Storage:
- The preprocessed data is stored on the Google Cloud Storage bucket as a collection multiple parquet files.
Transformation:
- With the help of PySpark, the data was cleaned and standardized to ensure its quality. For example, values like Austin, AUSTIN, AUS, and austin were replaced with only one standard Austin, and all errors in city names and locations were removed. In the data transformation step, I performed as well feature engineering and created new columns, for example: generalizing types of service requests, bringing them down to phone, email, web, app, and other; created the column that calculates how long the case was opened, extracted the values like month and year.
Load:
- The transformed data has been saved on the GCS bucket and then loaded into BigQuery as an external table. With the help of SQL query I created as well a table that is partitioned by date, using monthly intervals, and clustered by the method, the service request was received. These transformations help to optimize the SQL queries' performance.
- Infrastracture as Code: Terraform
- Data Lake: Google Cloud Storage
- Data Warehouse: BigQuery
- Orchestration: Airflow
- Data Transformations: Spark
- Serverless Cluster Service: Dataproc
- Containerization: Docker
- Data Visualisation: Looker
Expand
To reproduce this project you need to have a Google Cloud Account (additional cost may apply), have Docker and Docker Compose installed, have at least 5GB of free disk space to load docker images and start the project.Please ensure that you have all the necessary installations and install any missing ones by following the instructions provided below.
Expand
sudo apt-get update
sudo apr-get install docker.io
Expand
- Check if you have
~/bin/
directory. If not, create on with the commandcd && mkdir bin
- Move to the
bin/
with the commandcd bin
. - Download the binary file from Docker's GihHub repository.
wget https://github.com/docker/compose/releases/download/v2.24.1/docker-compose-linux-x86_64
- Rename the file to
docker-compose
- Make it executable
chmod +x docker-compose
- Open
.bashrc
file by running commandnano ~/.bashrc
and add the following line of code in the end of the file:
export PATH="${HOME}/bin:${PATH}"
- Run
source ~/.bashrc
- Check if docker compose works:
which docker-compose
should return the path to~/bin/docker-compose
docker-compose --version
should return the version you installed.
Expand
- Download a binary file from Terraform site
- Unzip it into the
~/bin
directory. - Check the terraform version by running a command
terraform --version
Expand
Google Cloud SDK will be installed automatically on project's Docker Image. If you'd like to install it locally as well, follow the official installation guide from [Google Cloud](https://cloud.google.com/sdk/docs/install)To use Google Cloud SDK and Terraform, you need to grant access to your Google Cloud Account. Follow the steps below to prepare your GCP account for the project.
Expand
-
On GCP create a project.
-
Enable API's for your project:
- Identity and Access Management (IAM) API
- IAM Service Account Credentials API
- Compute Engine API
- Cloud Dataproc API
- BigQuery API
- Bigquery Storage API
-
Navigate to IAM & Admin -> Service Accounts.
-
Create a new service account for the project. Make sure to add the following roles into it:
- BigQuery Admin
- Compute Admine
- Storage Admin
- Storage Object Admin
- Dataproc Administrator
- Owner
-
Click on the name of the service account and move to keys. Create a
*.json
file with a key, download it, and rename asapd311.json
. In the root dicrectory create a folder.gc
and move the key file into that folder:
mkdir ~/.gc
mv apd311.json ~/.gc/apd311.json
Alternativaly, you can store it in your preffered location and replace the path to the file in the docker-compose
and .env
files.
The next 2 steps are valid only for those, who choose to install Google SDK locally: 6. To add the environment variable with your credentials run in the terminal:
export GOOGLE_APPLICATION_CREDENTIALS="~/.gc/apd311.json"
If you don't want to manually add credentials path each session, you can add the code above into ~/.bashrc
file.
7. Authentificate Google SDK.
gcloud auth application-default login
- Clone the project from GitHub
https://github.com/nadia-paz/apd311.git
or download it by clicking on Code -> Download ZIP
- In the terminal move into the project's folder
apd311
.
Note!!! GCP buckets and project names are globally unique. You won't be able to create and/or use the same variables that I do. For reproducing the code you'll need to replace GCP_PROJECT_ID
and GCP_GCS_BUCKET
values with your own. Put your caustom values in docker-compose.yaml
file located in airflow
directory (section x-airflow-common:
-> environment
). If you stored and named the secret key other than ~/.gc/apd311.json
, you need to add custom value in volumes
section as well (In the code lines: 11,12,14,15,30). You can move to the next step only after you updated the information.
-
Move to
terraform
directory. Run the commandsterraform init
andterraform apply
. -
Move back to the airflow directoy
cd .. && cd airflow
-
In addition to
dags
directory, you'll need to create directoriesplugins
andlogs
.
mkdir plugins logs
Note for Linux users!!! You might need to do some workaround before starting Docker. Run in terminal
echo $(id -u)
In the .env
file replace the default numeric value of AIRFLOW_UID
with the one you've got in the terminal.
- Now we are ready to start our Docker. First, we need to rebuild the official airflow image to make it fit our needs. Then we are ready to start Airflow. In the
airflow
directory step by step run the commands:
sudo docker build . -t airflow-plus:latest
docker-compose up airflow-init
docker-compose up -d
With the command docker ps
check if all containers started. You should see:
- airflow-webserver
- airflow-scheduler
- airflow-init
- postgres
- If everything runs OK, we can login into Airflow Web. In the browser go to
http://localhost:8080
, enter the login:airflow
, and password:airflow
. On the home page you will see 3 DAGs (Directed Acyclic Graphs), that organize tasks together. Activate all DAGs by clicking on toggle button next to them.
All DAGs are scheduled to run. Alternatively, you can manually trigger them in the order:
upload_spark_file
>>pipeline
>>create_tables
Runs -> once on April, 11 2024, not scheduled.
Contains one task spark_job_file_task
that uploads the file spark_job.py
to Google Cloud Storage.
- Runs -> scheduled to run every week on Sunday at midnight GMT.
- Start date: April, 12 2024.
- End date: June 30, 2024
- Tasks:
save_data_task
-> extracts data from City of Austin Open Data Portal and loads it onto GCS bucket.create_cluster_task
-> creates Dataproc cluster on Google Cloud Platformsubmit_spark_job_task
-> submits Spark Job into Dataproc cluster.delete_cluster_task
-> deletes Dataproc cluster
- Runs -> scheduled to run every week on Sunday at 4:00 AM GMT.
- Start date: April, 12 2024.
- End date: June 30, 2024
- Tasks:
- for external table:
create_stage_dataset_task
create_stage_table_task
- for partitioned table:
create_main_dataset_task
create_main_table_task
- for external table:
After the tasks finish their run, you can move to BigQuery to work with data. Look for dataset apd311
, and table main_table
.
To create your own report with Looker Studio, click Create , pick Data Source -> BigQuery -> Project Name -> Dataset apd311
-> Table main_table
, press Create report. Make sure that you change all location fields to Geo data type. Press the ABC
icon next to location_sity
, location_county
, location_zip
fields, and the 123
icon next to location_lat
and location_long
. Pick GEO
. Create a dashboard.
- In the terminal move to the
terraform
directory and run the commandterraform destroy
- Go to Google Cloud Storge and manually delete Dataproc clusters.
- Move to
airflow
directory and disconnectDocker
by runnining command
docker-compose down --rmi "all" --volumes
--volume
and -rmi
are optional flags to remove all volumes (--volume
) and images (-rmi
) from your computer.
- Delete the project's directory.