Giter VIP home page Giter VIP logo

udacity-dend-project-1's Introduction

(Udacity: Data Engineering Nano Degree) | [email protected] | 2019-04-16 This project is a part of Udacity's Data Engineer Nano Degree.

PROJECT-1: Data Modelling with Postgres

Quick start

After installing python3 + postgresql libraries and dependencies, run from command line:

  • python3 create_tables.sql (to create the DB to Postgresql)
  • python3 etl.py (to process all the input data to the DB)

Overview

This Project-1 handles data of a music streaming startup, Sparkify. Data set is a set of files in JSON format and contains two parts:

  • ./data/song_data: static data about artists and songs
  • ./data/log_data: event data of service usage e.g. who listened what song, when, where, and with which client

Below, some figures about the data set (results after running the etl.py):

  • ./data/song_data: 71 files
  • ./data/log_data: 30 files
  • songplays: 6820
  • (unique) user: 97
  • songs: 71
  • artists: 69
  • songplays with an artist included in artist table: 1 (song_id = SOZCTXZ12AB0182364, artist_id = AR5KOSW1187FB35FF4)

Project builds an ETL pipeline (Extract, Transform, Load) to create the DB and tables, fetch data from JSON files, process the data, and insert the the data to DB. As technologies, Project-1 uses python, SQL, Postgresql DB.


About Database

Sparkify analytics database (called here sparkifydb) schema has a star design. Start design means that it has one Dimension Table having business data, and supporting Fact Tables. Dimension Table answers one of the key questions: what songs users are listening to. DB schema is the following:

SparkifyDB schema as ER Diagram

SparkifyDB schema as ER Diagram.

Fact Table

  • songplays: song play data together with user, artist, and song info (songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)

Dimension Tables

  • users: user info (columns: user_id, first_name, last_name, gender, level)
  • songs: song info (columns: song_id, title, artist_id, year, duration)
  • artists: artist info (columns: artist_id, name, location, latitude, longitude)
  • time: detailed time info about song plays (columns: start_time, hour, day, week, month, year, weekday)

HOWTO use

Project has two scripts:

  • create_tables.py: This script drops existing tables and creates new ones.
  • etl.py: This script uses data in ./data/song_data and ./data/log_data, processes it, and inserts the processed data into DB.

Prerequisites

Python3 is recommended as the environment. The most convenient way to install python is to use Anaconda (https://www.anaconda.com/distribution/) either via GUI or command line. Also, the following libraries are needed for the python environment to make Jupyter Notebook and Postgresql to work:

  • postgresql (+ dependencies) to enable sripts and Jupyter to connect to Postgresql DB.
  • jupyter (+ dependencies) to enable Jupyter Notebook.
  • ipython-sql (https://anaconda.org/conda-forge/ipython-sql) to make Jupyter Notebook and SQL queries to Postgresql work together. NOTE: you may need to install this library from command line.

Run create_tables.py

Type to command line:

python3 create_tables.py

Output: Script writes "Tables dropped successfully" and "Tables created successfully" if all tables were dropped and created without errors.

Run etl.py

Type to command line:

python3 etl.py

Output: Script crawls through all the data directories, tells how many files it has to process, and writes to console the progress of the script as it processes them through. After successfully processing through a directory, script summarised the results:

  • "All xx files processed OK in aaa/abc123" (e.g. data/song_data) and
  • "All xx files processed OK in aaa/def456" (e.g. data/log_data).

Data cleaning process

etl.pyworks the following way to clean-up and process the source data:

  • Scripts crawls through all source data JSON files.
  • From each file in ./data/song_data directory, script selects songs table values to a song_data DataFrame and inserts them to songs table, and artist table values to artist_data DataFrame and inserts them to artists table.
  • From each file in ./data/logdata directory, script filters items with NextSong value in page parameter. It then processes _time_data from ts parameter value to be inserted in time table and user_data to be inserted in users table.
  • ./data/songdata directory also contains data which is processed into _songplay_data. User_id and artist_id fields are fetched from users and artists table using song_select query. Other parameters are selected from log_data. Formed songplay_data data is finally inserted into songplays table.

Other

Project-1 also contains the following files:

  • etl.ipynb: Jupyter Notebook for interactively develop and run python code to be used in etl.py.
  • test.ipynb: Jupyter Notebook for interactively run test SQL queries against used DB.

Summary

Project-1 provides customer startup Sparkify tools to analyse their service data and help them answer their key business questions like "What songs users are listening to".

udacity-dend-project-1's People

Contributors

jukkan01 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.