This project performs a comprehensive data analysis on the Chinook Database using SQL, Python, and data visualization techniques. The Chinook Database holds information about a music store. For a deeper understanding, it is divided into 11 tables namely: 'Employee', 'Customer', 'Invoice', 'InvoiceLine', 'Artist', 'Album', 'Track', 'Genre', 'MediaType', 'Playlist', and 'PlaylistTrack'.
The objective of this project is to answer meaningful business-oriented questions about the company's operations. This includes information on sales, customers, employee performance, and more.
This project is implemented using SQL for database querying, Python for data processing and visualization, and Matplotlib for generating the graphs. SQLite was used for the database.
-
Inspect all tables in the database: SQL queries are written to fetch the contents of each table, display the first few rows, and understand the structure of the data.
-
Check for missing data: A Python function iterates over each table and column, checking for any missing values.
-
Answer business questions using SQL: 20 business-oriented questions are addressed, allowing for exploration of a variety of SQL queries, enhancing understanding and showcasing SQL skills.
-
Visualize the results: After querying the database, Python and Matplotlib are used to create visualizations that will help better understand the data and the findings.
This project answers the following questions:
- What is the total revenue per country?
- Who are the top 5 customers by revenue?
- What are the top 10 best selling tracks?
- What percentage of total revenue does each country contribute?
- What are the top 5 albums by sales?
- What percentage of total sales does each genre represent?
- Who are the top 3 employees by sales?
- What is the sales data per year?
- What is the sales data per month?
- How many customers does each employee support?
- What percentage of total sales does each media type represent?
- How many playlists include each track?
- What is the total time length of all tracks sold per genre?
- What percentage of total revenue does each media type contribute?
- What are the top 5 countries by invoice count?
- What is the distribution of sales among the different billing cities?
- How many invoices were there each year?
- How many tracks are there in each playlist?
- What is the average invoice total per country?
- What is the average number of tracks per album?
- 'eda_sql_python.ipynb': Jupyter notebook containing all the SQL queries, data analysis steps, and visualizations.
- 'Chinook_Sqlite.sqlite': SQLite database file.
- 'README.md': This file, providing an overview of the project.
To run this project on your local machine, follow these steps:
- Clone the Repository: First, clone this repository to your local machine using the following command in your terminal or command prompt:
git clone https://github.com/rawbeen248/eda-sql-python
- Navigate to the Cloned Directory: Use the following command to go to the cloned directory:
cd eda-sql-python
- Install Required Python Libraries: This project requires Python and some Python libraries. You can install the necessary libraries using the following command:
pip install pandas matplotlib sqlite3
- Open the Jupyter Notebook and run the cells: Once the Jupyter notebook is open, you can run each cell by selecting the cell and clicking on 'Run' or using the shortcut Shift + Enter. The SQL queries will fetch data from the database and the Python codes will process and visualize the data.
Please note that the SQLite database file Chinook_Sqlite.sqlite should be in the same directory as the Jupyter notebook for the code to run properly.