Giter VIP home page Giter VIP logo

dend-project-3-data-warehouse-aws's Introduction

Project 3: Song Play Analysis With S3 and Redshift

Project passed

Summary


Preamble

In this project we are going to use two Amazon Web Services, S3 (Data storage)
and Redshift (Data warehouse with columnar storage)

Data sources are provided by two public S3 buckets. One bucket contains
info about songs and artists, the second has info concerning actions done
by users (which song are listening, etc.. ). The objects contained in both buckets
are JSON files. The song bucket has all the files under the same directory but
the event ones don't, so we need a descriptor file (also a JSON) in order to extract
data from the folders by path. We used a descriptor file because we don't have a common prefix on folders

The Redshift service is where data will be ingested and transformed,
in fact though COPY command we will access to the JSON files inside
the buckets and copy their content on our staging tables


Schema definition

This is the schema of the database

How to read the schema:

  • Blank bullets are used to identify the fields that can be null
  • Black bullets are used to identify the fields that can not be null
  • If the field is underlined means that is a primary key

schema

To represent this context a Star schema has been used

The songplays table is the core of this schema, is it our fact table and
it contains foreign keys to four tables;

  • start_time REFERENCES time(start_time)
  • user_id REFERENCES time(start_time)
  • song_id REFERENCES songs(song_id)
  • artist_id REFERENCES artists(artist_id)

There are also two staging tables; One for song dataset and one for
and one for event dataset


ETL process

In this project most of ETL is done with SQL (Python used just as bridge), transformation and data normalization is done by Query, check out the sql_queries python module


How to run

Although the data-sources are provided by two S3 buckets the only thing you need for running the example is an AWS Redshift Cluster up and running

And of course Python

Notes:

  • In this example a Redshift dc2.large cluster with 4 nodes has been created, with a cost of USD 0.25/h (on-demand option) per cluster
  • In this example we will use IAM role authorization mechanism, the only policy attached to this IAM will be am AmazonS3ReadOnlyAccess

After opening terminal session, set your filesystem on project root folder
and insert these commands in order to run the demo:

This will create our tables, this must be runned first
python create_tables.py

And this will execute our ETL process
python etl.py


Project structure

This is the project structure, if the bullet contains /
means that the resource is a folder:

  • /img - Simply a folder with images that are used in this md
  • create_tables.py - This script will drop old tables (if exist) ad re-create new tables
  • etl.py - This script executes the queries that extract JSON data from the S3 bucket and ingest them to Redshift
  • sql_queries.py - This file contains variables with SQL statement in String formats, partitioned by CREATE, DROP, COPY and INSERT statements
  • dhw.cfg - Configuration file used that contains info about Redshift, IAM and S3

dend-project-3-data-warehouse-aws's People

Contributors

federicoserini avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.