Giter VIP home page Giter VIP logo

zero2hero's Introduction

zero2hero's People

Contributors

obar1 avatar

Watchers

 avatar  avatar

zero2hero's Issues

https://www.startdataengineering.com/post/how-to-schedule-a-sql-script-using-apache-airflow-with-an-example/

www.startdataengineering.com /post/how-to-schedule-a-sql-script-using-apache-airflow-with-an-example/
Scheduling a SQL script, using Apache Airflow, with an example
4-5 minutes 3/29/2020
Mar 29, 2020 · 3 min read

One of the most common use cases for Apache Airflow is to run scheduled SQL scripts. Developers who start with Airflow often ask the following questions

“How to use airflow to orchestrate sql?”

“How to specify date filters based on schedule intervals in Airflow?”

This post aims to cover the above questions. This post assumes you have a basic understanding of Apache Airflow and SQL. Let’s see how we can schedule a SQL script using Airflow, with an example. Let’s assume we want to run a SQL script every day at midnight. This SQL script performs data aggregation over the previous day’s data from event table and stores this data in another event_stats table.

Sample ETL

We can use Airflow to run the SQL script every day. In this example we use MySQL, but airflow provides operators to connect to most databases.

DAG: Directed Acyclic Graph, In Airflow this is used to denote a data pipeline which runs on a scheduled interval. A DAG can be made up of one or more individual tasks.

from airflow import DAG
from airflow.operators.mysql_operator import MySqlOperator

default_arg = {'owner': 'airflow', 'start_date': '2020-02-28'}

dag = DAG('simple-mysql-dag',
default_args=default_arg,
schedule_interval='0 0 * * *')

mysql_task = MySqlOperator(dag=dag,
mysql_conn_id='mysql_default',
task_id='mysql_task'
sql='/sample_sql.sql',
params={'test_user_id': -99})

mysql_task
In the above script

0 0 * * * is a cron schedule format, denoting that the DAG should be run everyday at midnight, which is denoted by the 0th hour of every day. (note that Airflow by default runs on UTC time) mysql_conn_id is the connection id for your SQL database, you can set this in admin -> connections from airflow UI. There you will set the username and password that Airflow uses to access your database. ‍
The SQL script to perform this operation is stored in a separate file sample_sql.sql. This file is read in when the DAG is being run.
USE your_database;

DROP TABLE IF EXISTS event_stats_staging;
CREATE TABLE event_stats_staging
AS SELECT date
, user_id
, sum(spend_amt) total_spend_amt
FROM event
WHERE date = {{ macros.ds }}
AND user_id <> {{ params.test_user_id }}
GROUP BY date, user_id;

INSERT INTO event_stats (
date
, user_id
, total_spend_amt
)
SELECT date
, user_id
, total_spend_amt
FROM event_stats_staging;

DROP TABLE event_stats_staging;
The values within {{ }} are called templated parameters. Airflow replaces them with a variable that is passed in through the DAG script at run-time or made available via Airflow metadata macros. This may seem like overkill for our use case. But it becomes very helpful when we have more complex logic and want to dynamically generate parts of the script, such as where clauses, at run time.

There are 2 key concepts in the templated SQL script shown above

Airflow macros: They provide access to the metadata that is available for each DAG run. We use the execution date as it provides the previous date over which we want to aggregate the data. ref: https://airflow.apache.org/docs/stable/macros.html

Templated parameters: If we want our SQL script to have some parameters that can be filled at run time from the DAG, we can pass them as parameters to the task. In our example we filter out a specific user_id (-99) before aggregation.

In this post we saw

How to schedule SQL scripts using Apache Airflow
How Airflow connects to the database using the connection id
How to pass in parameters at run-time using input parameters and macros
Hope this gives you an understanding of how to schedule SQL scripts in Airflow and how to use templating. If you have any questions or comments, please let me know below.

Previous Chapter
Next Chapter

https://www.startdataengineering.com/post/what-does-it-mean-for-a-column-to-be-indexed/

https://www.startdataengineering.com/post/what-does-it-mean-for-a-column-to-be-indexed/

www.startdataengineering.com /post/what-does-it-mean-for-a-column-to-be-indexed/
What Does It Mean for a Column to Be Indexed
11-14 minutes 5/2/2020
May 2, 2020 · 10 min read

When optimizing queries on a database table, most developers tend to just create an index on the field to be queried. They have questions like

I don’t really understand what it means for a column to be “indexed”

in addition to simply boosting the efficiency of a query, are there any other uses of indexing or things to be aware of?

In this post

We cover the most widely used index structure in databases, the B-tree
We go over Scan Types, which is how the database engine reads data from disk
Run experiments to understand how index affects CRUD operations for different types of fields and queries
Go over the trade-offs that are made when we use an index
B-tree
Invented in the 1970’s B-tree’s are the most popular data structure used as default indexing mechanism in most databases. Although there are many types of indexing data structures, we are going to only look at b-tree as they are the most popular and widely used. In order to understand B-tree let’s consider a simple dataset, users with the following fields

user_id: a unique identifier for the user (assume user ids start from 1 - 1000)
user_name: the name of the user
user_account_type: the account type of the user (assume it can only be one of C or S or I)
The database stores data in small chunks on the disk called blocks or pages, traditionally 4KB in size. This is hardware implementation based optimization. When we create an index on user_id column, a separate B-tree is created with the following approximate shape. The B-tree starts at the root page and has references to other pages. The pages which points to the data are called the leaf pages.

B-Tree ‍

As you can see from the diagram B-Tree has many pages , each leading to a leaf page from which we can get a pointer to the data location(implementation depends on the database being used). The number of references to child pages from a parent page is called branching factor in our example it is 5.

The advantage of B-tree is that data access has a Big-O runtime of O(log(n)) which is very efficient. (ref Big-O )

Then why not create indexes for all columns in the database? Well as with most things in software engineering there is a tradeoff. In case of indexes it is the extra work that the database engine has to do when we insert/update/delete records to keep the B-Tree balanced. We will see the pros and cons of indexing in the last section.

Scan types
All database operations are one of Create Read Update Delete(CRUD) operations. These operations require the database engine to read/write data from the disk, accessing one or many blocks/pages of the data from the disk requires scanning of the data to get the exact block(s). A scan is the operation we aim to optimize with an index. We will see 4 main types of scan

Seq Scan: The database engine scans through all the data sequentially

Parllel Seq Scan: The database scans through all the data in parallel with n number of workers (can be gotten using explain). This is not used always because the overhead of starting, managing and reading data from multiple workers is a lot, but used to improve the performance of queries that require scanning a large amount of data but only few of them satisfy the selection criteria.

Index Scan: If your query has a filter on an indexed column, the database engine uses the B-tree index to get the data location and just reads that page (this is the fastest). Depending on the database, if the database engine determines your query will return > 5-10% of the entire data Index Scan will be skipped for a Seq Scan, this is because the database engine decides that overhead of getting the page location from B-tree index, then reading the page for multiple values of the indexed column is not worth it compared to a simple sequential scan of the entire data.

Bitmap Index Scan: This is a combination of Index Scan and Seq Scan, this scan is used when the number of rows to be selected are too large for Index Scan and too low for Seq Scan.

Experiments
In this section we will go over some common scenarios, see how the queries perform and cases when a specific type of scan will be preferred by the database engine. The idea is to get a sense of why a certain query may be faster. In the next section we will see the pros and cons of indexing

Set up
In order to run the experiments you will need

docker (preferred) to run postgres
pgcli to connect to our postgres instance
Data Download here
Download the data to a folder in your local machine. Docker set up with your data folder as a volume mount

docker run --name pg_local -p 5432:5432 -e POSTGRES_USER=start_data_engineer
-e POSTGRES_PASSWORD=password -e POSTGRES_DB=tutorial
-v :/data -d postgres:12.2
Start pgcli

pgcli -h localhost -p 5432 -U start_data_engineer tutorial
type in password when pgcli prompts you for password, and create a user table as shown below

CREATE SCHEMA bank;
SET search_path TO bank,public;
CREATE TABLE bank.user (
user_id int,
user_name varchar(50),
score int,
user_account_type char(1)
);
COPY bank.user FROM '/data/user.csv' DELIMITER ',' CSV HEADER;
Explain
We can use the explain command to explain how the database engine is going to execute our query. It is meant to be read bottom up. (ref: EXPLAIN ) This command gives you a cost metric, which is what the database engine aims to minimize.

Experiment 1 (E1)
A select query with equality filter on score, without an index on score

explain select * from bank.user where score = 900001;
Experiment 1 ‍

Scan Type: Parallel seq scan because a scan needs to be done on the entire dataset (since we have no index) and the result will be very few(if any) rows from the table.
Execution taken: 0.055s
Experiment 2 (E2)
A select query with range filter on score, without an index on score

explain select * from bank.user where score > 900000;
‍ Experiment 2 ‍

Scan Type: Seq Scan because a scan needs to be done on the entire dataset (since we have no index).
Execution taken: 0.354s
Experiment 3 (E3)
An update query on score with filter on score, without an index on score

explain update bank.user set score = 1000000 where score > 1000;
Experiment 3 ‍

Scan Type: Seq Scan because a scan needs to be done on the entire dataset (since we have no index) filter for data which has score>1000 and then update the records that satisfy the filter.
Execution taken: 2.447s
Experiment 4 (E4)
A select query with equality filter on user_account_type, without an index on user_account_type

explain select * from bank.user where user_account_type = 'S';
Experiment 4 ‍

Scan Type: Seq Scan because a scan needs to be done on the entire dataset (since we have no index) filter for data which has user_account_type='S’ and then update the records that satisfy the filter.
Execution taken: 0.830s
Experiment 5 (E5)
A select query with filter on score, with an index on score

CREATE INDEX score_idx ON bank.user (score);
-- Time 0.709s
explain select * from bank.user where score = 900001;
Experiment 5 ‍

Scan Type: Index Scan which means it uses the b-tree index to get the reference to the page the required data lives in.
Execution taken: 0.019s
Experiment 6 (E6)
A select query with range filter on score, with an index on score

explain select * from bank.user where score > 900000;
Experiment 6 ‍

Scan Type: Bitmap Index Scan which means the number of records that satisfy the condition is estimated to be too large to benefit from Index Scan and is too small to warrant a full Seq Scan
Execution taken: 0.411s
explain select * from bank.user where score > 600000;
Experiment 6.2 ‍

Scan Type: Seq Scan is used even though we have an index on the score column. This is because the database engine estimates that the cost of checking the index, then grabbing the data page and repeating this for each score > 600000 is more than a simple Seq Scan
Execution taken: 0.585s
Experiment 7 (E7)
An update query on score with filter on score, with an index on score

explain update bank.user set score = 1000000 where score > 1000;

Experiment 7 ‍

Scan Type: Bitmap Index Scan which means the number of records that satisfy the condition is estimated to be too large to benefit from Index Scan and is too small to warrant a full Seq Scan
Execution taken: 4.239s
Experiment 8 (E8)
A select query with equality filter on user_account_type, with an index on user_account_type

CREATE INDEX acct_type_idx ON bank.user (user_account_type);
-- Time: 0.846s
explain select * from bank.user where user_account_type = 'S';
Experiment 8 ‍

Scan Type: Bitmap Index Scan which means the number of records that satisfy the condition is estimated to be too large to benefit from Index Scan and is too small to warrant a full Seq Scan
Execution taken: 0.660s
CREATE SCHEMA bank;
SET search_path TO bank,public;
CREATE TABLE bank.user (
user_id int,
user_name varchar(50),
score int,
user_account_type char(1)
);
COPY bank.user FROM '/data/user.csv' DELIMITER ',' CSV HEADER;
-- Time: 1.122s

-- E1
explain select * from bank.user where score = 900001;
-- Time: 0.070s

-- E2
explain select * from bank.user where score > 900000;
-- Time: 0.422s

-- E3
explain update bank.user set score = 1000000 where score > 1000;
-- Time: 2.580s (2 seconds)

-- E4
explain select * from bank.user where user_account_type = 'S';
-- Time: 0.747s

-- Adding Index
CREATE INDEX score_idx ON bank.user (score);
CREATE INDEX acct_type_idx ON bank.user (user_account_type);

-- E5
explain select * from bank.user where score = 900001;
-- Time: 0.017s

-- E6
explain select * from bank.user where score > 900000;
-- Time: 1.351s (a second)
explain select * from bank.user where score > 600000;
-- Time: 1.675s (a second)

-- E7
explain update bank.user set score = 1000000 where score > 1000;
-- Time: 8.627s (8 seconds)

-- E8
explain select * from bank.user where user_account_type = 'S';
-- Time: 0.883s (a moment)

-- Note: Execution time may vary slightly
Cardinality
Cardinality is defined as the uniqueness of the data field. For e.g. if we have a field called gender with M or F as acceptable values, then we say that field has low cardinality. Basically you can think of

high cardinality field has many unique values (e.g. score in our dataset)
low cardinality field has few unique values (e.g. user_account_type in our dataset)
Conclusion
Hope the above experiments give you a good understanding of when and how to use indexes, how to use explain to check your query performance and what the different types of scans mean.

TL;DR
Pros of using an index:

Query with filter on an indexed column is much faster than one without. Check E1 v E5
Cons of using an index:

Performance drop for insert/update/delete operations, since the B-tree also has to be managed. Compare E7 and E3, Update time 8.627s with index v 2.447s without index.
Depending on the flavor of SQL you are using, Create, Update, Delete operations can be locked while the column is being indexed.
An index will not be used always, depending on the estimated result size of your query (use Explain to check your query plan)
The increase in performance of index for low cardinality column will be lower than that of a high cardinality one. In our example reduction in latency by adding index on a low cardinality field user_account_type was 20%, whereas the reduction in latency by adding index on a high cardinality field score was 65%
Hope this gives you a good idea of how to effectively use indexes. If you have any questions or comments please reach out or leave a comment below.

Previous Chapter
Next Chapter

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.