Giter VIP home page Giter VIP logo

dsc-obtaining-your-data's Introduction

Obtaining Your Data

Introduction

In this lesson, you'll synthesize many of your data loading skills you learned to date in order to merge multiple datasets from various sources as part of an ETL process.

Objectives

You will be able to:

  • Describe the ETL process and the steps it consists of
  • Perform an ETL process with multiple tables and create a single dataset

Loading SQL DB to DataFrames

Recall, that ETL refers to the Extract, Transform, Load procedure for obtaining data that you can work with. Here, you'll reinforce your ETL skills by creating a single dataset from multiple data tables.

Start extracting data with the code cells below!

import sqlite3
import pandas as pd

# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM orders JOIN orderdetails USING(orderNumber);""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()
(2996, 11)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
orderNumber orderDate requiredDate shippedDate status comments customerNumber productCode quantityOrdered priceEach orderLineNumber
0 10100 2003-01-06 2003-01-13 2003-01-10 Shipped 363 S18_1749 30 136.00 3
1 10100 2003-01-06 2003-01-13 2003-01-10 Shipped 363 S18_2248 50 55.09 2
2 10100 2003-01-06 2003-01-13 2003-01-10 Shipped 363 S18_4409 22 75.46 4
3 10100 2003-01-06 2003-01-13 2003-01-10 Shipped 363 S24_3969 49 35.29 1
4 10101 2003-01-09 2003-01-18 2003-01-11 Shipped Check on availability. 128 S18_2325 25 108.06 4
# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM products;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()
(110, 9)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
productCode productName productLine productScale productVendor productDescription quantityInStock buyPrice MSRP
0 S10_1678 1969 Harley Davidson Ultimate Chopper Motorcycles 1:10 Min Lin Diecast This replica features working kickstand, front... 7933 48.81 95.70
1 S10_1949 1952 Alpine Renault 1300 Classic Cars 1:10 Classic Metal Creations Turnable front wheels; steering function; deta... 7305 98.58 214.30
2 S10_2016 1996 Moto Guzzi 1100i Motorcycles 1:10 Highway 66 Mini Classics Official Moto Guzzi logos and insignias, saddl... 6625 68.99 118.94
3 S10_4698 2003 Harley-Davidson Eagle Drag Bike Motorcycles 1:10 Red Start Diecast Model features, official Harley Davidson logos... 5582 91.02 193.66
4 S10_4757 1972 Alfa Romeo GTA Classic Cars 1:10 Motor City Art Classics Features include: Turnable front wheels; steer... 3252 85.68 136.00

Merging Data

Recall, that you can also join data from multiple tables in SQL.

# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM products
                        JOIN orderdetails
                        USING (productCode);""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()
(2996, 13)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
productCode productName productLine productScale productVendor productDescription quantityInStock buyPrice MSRP orderNumber quantityOrdered priceEach orderLineNumber
0 S10_1678 1969 Harley Davidson Ultimate Chopper Motorcycles 1:10 Min Lin Diecast This replica features working kickstand, front... 7933 48.81 95.70 10107 30 81.35 2
1 S10_1678 1969 Harley Davidson Ultimate Chopper Motorcycles 1:10 Min Lin Diecast This replica features working kickstand, front... 7933 48.81 95.70 10121 34 86.13 5
2 S10_1678 1969 Harley Davidson Ultimate Chopper Motorcycles 1:10 Min Lin Diecast This replica features working kickstand, front... 7933 48.81 95.70 10134 41 90.92 2
3 S10_1678 1969 Harley Davidson Ultimate Chopper Motorcycles 1:10 Min Lin Diecast This replica features working kickstand, front... 7933 48.81 95.70 10145 45 76.56 6
4 S10_1678 1969 Harley Davidson Ultimate Chopper Motorcycles 1:10 Min Lin Diecast This replica features working kickstand, front... 7933 48.81 95.70 10159 49 81.35 14

You can also merge data from a separate csv file. For example, say you take a separate data source regarding daily sales data for the various branches. You might first generate a view from our database:

# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM customers
                        JOIN orders
                        USING(customerNumber);""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()
(326, 19)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
customerNumber customerName contactLastName contactFirstName phone addressLine1 addressLine2 city state postalCode country salesRepEmployeeNumber creditLimit orderNumber orderDate requiredDate shippedDate status comments
0 103 Atelier graphique Schmitt Carine 40.32.2555 54, rue Royale Nantes 44000 France 1370 21000.00 10123 2003-05-20 2003-05-29 2003-05-22 Shipped
1 103 Atelier graphique Schmitt Carine 40.32.2555 54, rue Royale Nantes 44000 France 1370 21000.00 10298 2004-09-27 2004-10-05 2004-10-01 Shipped
2 103 Atelier graphique Schmitt Carine 40.32.2555 54, rue Royale Nantes 44000 France 1370 21000.00 10345 2004-11-25 2004-12-01 2004-11-26 Shipped
3 112 Signal Gift Stores King Jean 7025551838 8489 Strong St. Las Vegas NV 83030 USA 1166 71800.00 10124 2003-05-21 2003-05-29 2003-05-25 Shipped Customer very concerned about the exact color ...
4 112 Signal Gift Stores King Jean 7025551838 8489 Strong St. Las Vegas NV 83030 USA 1166 71800.00 10278 2004-08-06 2004-08-16 2004-08-09 Shipped

And then load the seperate datefile:

daily_sums = pd.read_csv('Daily_Sales_Summaries.csv')
daily_sums.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
orderDate min max sum mean std
0 2003-01-06 1660.12 4080.00 10223.83 2555.957500 1132.572429
1 2003-01-09 1463.85 4343.56 10549.01 2637.252500 1244.866467
2 2003-01-10 1768.33 3726.45 5494.78 2747.390000 1384.599930
3 2003-01-29 1283.48 5571.80 50218.95 3138.684375 1168.280303
4 2003-01-31 1338.04 4566.99 40206.20 3092.784615 1148.570425
merged = pd.merge(df, daily_sums)

Checking Merged Data

It's always good practice to check assumptions and preview transformed data views throughout your process. Let's take a look:

merged.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
customerNumber customerName contactLastName contactFirstName phone addressLine1 addressLine2 city state postalCode ... orderDate requiredDate shippedDate status comments min max sum mean std
0 103 Atelier graphique Schmitt Carine 40.32.2555 54, rue Royale Nantes 44000 ... 2003-05-20 2003-05-29 2003-05-22 Shipped 2163.50 5282.64 14571.44 3642.860000 1322.891537
1 103 Atelier graphique Schmitt Carine 40.32.2555 54, rue Royale Nantes 44000 ... 2004-09-27 2004-10-05 2004-10-01 Shipped 1938.24 4128.54 6066.78 3033.390000 1548.775983
2 103 Atelier graphique Schmitt Carine 40.32.2555 54, rue Royale Nantes 44000 ... 2004-11-25 2004-12-01 2004-11-26 Shipped 557.60 7573.50 20564.45 2570.556250 2178.832190
3 350 Marseille Mini Autos Lebihan Laurence 91.24.4555 12, rue des Bouchers Marseille 13008 ... 2004-11-25 2004-12-02 2004-11-29 Shipped 557.60 7573.50 20564.45 2570.556250 2178.832190
4 112 Signal Gift Stores King Jean 7025551838 8489 Strong St. Las Vegas NV 83030 ... 2003-05-21 2003-05-29 2003-05-25 Shipped Customer very concerned about the exact color ... 798.38 4704.92 40207.06 2680.470667 1255.052262

5 rows ร— 24 columns

Pandas' merge() function conveniently uses common column names between the DataFrames as keys. You can always specify what columns to join on by using the on keyword as in pd.merge(df1, df2, on=[col1, col2]). Unfortunately, columns that are not identically named beforehand will not work with this method. Additionally, it is imperative to check the formatting of the join keys between the tables. A number formatted as a string can often ruin joins, and separate formatting conventions such as 'U.S.' versus 'USA' are also important preprocessing considerations before merging data files from various sources. In this case, everything worked smoothly, but it's good to keep in mind what problems may occur.

Saving Transformed Data to File

Finally, we can save our transformed dataset.

merged.to_csv('Merged_Dataset.csv', index=False)

Summary

Well done! In this lesson you went through an ETL process by performing merges. You also saw the potential pitfalls in merging datasets from different sources. In the next lab, you'll get some practice doing this as an initial step to a regression task.

dsc-obtaining-your-data's People

Contributors

loredirick avatar mas16 avatar mathymitchell avatar sumedh10 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.