Giter VIP home page Giter VIP logo

sparkify-etl's Introduction

Schema for Song Play Analysis

General overview

This project contains code needed to create a database for Sparkify, this database will be used for analytical purposes.

Purpose

Sparkify, a startup in the music streaming industry, wants to analyze data about songs and user activity. To do so, right now they have JSON files to do the analysis, this JSON files are not of easy access to analyze data. For this reason, they need a database that can fill their analytical goals. They plan to use this database to understand the songs that their users are listening to.

Database design and ETL pipeline

The database is constructed on PostgreSQL, and it follows a star schema for analytic focus. The fact table is the songplays table. The dimension tables are users table, songs table, artists table and time. In order to see the details about this step see create_tables.py and sql_queries.py

The ETL pipeline is the following:

  • We extract data from JSON format into Python
  • In Python we take the data into the right format.
  • Finally, we load it into our PostgreSQL database called sparkifydb. To see the details about this step see etl.py.

In order to run the whole ETL pipeline, first run create_tables.py and then run etl.py.

Example

We provide example queries and results for song play analysis. If you want to test this queries yourself you can use _test.ipynb .

  • An important query to keep track of is the number of free and paid users, and we can segment this using gender.
SELECT s.level, u.gender, count(*) quantity 
FROM songplays s 
JOIN users u 
ON u.user_id = s.user_id
GROUP BY s.level, u.gender

With the data that we have we can see that there are more females in the paid group than any other group.

  • We can also check data quality with SQL queries:
SELECT 
    SUM(CASE WHEN year = 0 THEN 1 ELSE 0 END) year_quantity_Null, 
    ROUND(AVG(CASE WHEN year = 0 THEN 1 ELSE 0 END)*100,2) year_percentage_null 
FROM songs

From this we can see that 43 songs do not have year provided. And this represents roughly 60% of the total songs. This must be fixed since some analysis can be missinterpred.

Next steps

As any project, there are things that can be improved or added. Here I present a list of things that I consider are important for a second stage of the project.

  • A data cleaning process to make sure the whole database has a high quality standard of data. As an example, currently, I presented a simple example of quality of data, but this example can be extended to check quality as a whole.
  • By now we have only extracted information from json format. For this reason we still need to extract information via Json to then upload into our PostgreSQL database. This is inefficient since the json step can be ommited.

sparkify-etl's People

Contributors

jestefano avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.