Giter VIP home page Giter VIP logo

python_project_1's Introduction

Python_Project_1

Introduction

  • In this project, a 'Tracks Dataset' obtained from the streaming plattform,Spotify, is presented.

  • The link to the dataset is as given: link

  • My aim is to perform exploratory data analysis on the dataset using the 'Python Programming Language' on the Jupyter Notebook IDE and draw meaningful insights in regards to the performance of different tracks on the Spotify Plattform.

Dataset Overview

  • I first imported libraries that shall be useful in performing my analysis.

    • Code: import numpy as np,import pandas as pd,import matplotlib.pyplot as plt and import seaborn as sns into my Jupyter notebook.
  • I then proceeded to import the tracks dataset(named it 'df_tracks') and then confirmed that it loaded correctly by running a data overview query.

    • Code 1: df_tracks=pd.read_csv( ) and Code 2: df_tracks.head( )
  • Checked for null values across all the columns.Only the 'name column' had null values,the values summed upto 71.

    • Code: pd.isnull(df_tracks).sum( )
  • Obtained information about the dataset.The dataset has a total of 20 columns and 586672 records.In addittion, by running the code, I was able to know the column names and the data types of data values in each column.

    • Code: df_tracks.info( )

    • Output:

    Column Non-Null Count Dtype

0 id 586672 non-null object

1 name 586601 non-null object

2 popularity 586672 non-null int64

3 duration_ms 586672 non-null int64

4 explicit 586672 non-null int64

5 artists 586672 non-null object

6 id_artists 586672 non-null object

7 release_date 586672 non-null object

8 danceability 586672 non-null float64

9 energy 586672 non-null float64

10 key 586672 non-null int64

11 loudness 586672 non-null float64

12 mode 586672 non-null int64

13 speechiness 586672 non-null float64

14 acousticness 586672 non-null float64

15 instrumentalness 586672 non-null float64

16 liveness 586672 non-null float64

17 valence 586672 non-null float64

18 tempo 586672 non-null float64

19 time_signature 586672 non-null int64

dtypes: float64(9), int64(6), object(5)

Exploratory Data Analysis

  • For starters,I obtained the descriptive statistics of the data values in each column,i.e;count,mean,std,min,max,25%,50%,75%.

    • Code: df_tracks.describe( ).transpose()
  • Obtained the bottom 10 songs in terms of popularity.This is so as to study the characteristics of the songs and figure out the cause of their underperformance on the Spotify plattform.

    • Code: sorted_df=df_tracks.sort_values('popularity',ascending=True).head(10)
  • Obtained the top 10 songs with popularity of over 90.This is so as to study the characteristics of the songs and figure out the cause of their good performance on the Spotify plattform.

    • Code 1: most_popular=df_tracks.query('popularity'>90,inplace=False).sort_values('popularity',ascending=False)

    • Code 2: most_popular[:10]

  • Obtained the artist at the 18th position in the dataset.From running the code,I was able to know that the artistVictor Boucher held the 18th position.

    • Code: df_tracks[artists].iloc[18]
  • Created a new column namely 'duration' which contains data values from the 'duration_ms' column converted from milliseconds to seconds.In addittion,I dropped the duration_ms column.

    • Code: df_tracks["duration"]=df_tracks["duration_ms"].apply(lambda x:round(x/1000))

    • Code 2: df_tracks.drop("duration_ms",inplace=True,axis=1)

    • To confirm that the column has been created successfully,I ran the code: df_tracks.duration.head()

  • Created a sample dataset(namely:'sample_df') from the existing dataset.This sample dataset is intented to make the EDA faster and more efficcient.The sample dataset contains 2346 records.

    • Code: sample_df=df_tracks.sample(int(0.004*len(df_tracks))) , print(len(sample_df))

    Data Visualization.

    1) Heatmap to display correlation between the dataset variables.

    • The heatmap displays information about the extent of correlation between variables.The variables can either be positively correlated( 0 > x < 1) or negatively correlated(-1 > x < 0).

    • Code:

      corr_df=df_tracks.drop(["key","mode","explicit"],axis=1).corr(method='pearson')

      plt.figure(figsize=(14,6))

      heatmap=sns.heatmap(corr_df,annot=True,fmt=".1g",vmin=-1,vmax=1,center=0,cmap="inferno",linewidth=1,linecolor="Black")

      heatmap.set_tittle=("Correlation Heatmap between Variables")

      heatmap.set_xticklabels(heatmap.get_xticklabels(),rotation=90)

image

2) Regression Plot to display the correlation between the variables 'Loudness' and 'Energy'.

  • From the visualization,it is clear to see that the two variables are highly positively correlated.

  • Code:

    plt.figure(figsize=(10,6))

    snsA.regplot(data=sample_df,y="loudness",x="energy",color="c").set(title="Loudness vs Energy Correlation")

image

3) Regression plot to display the correlation between the variables 'Popularity' and 'Acousticness'.

  • From the visualization,we can observe that the two variables are negatively correlated.

  • Code:

    plt.figure(figsize=(10,6))

    sns.regplot(data=sample_df,y="popularity",x="acousticness",color="b").set(title="Popularity vs Acousticness Correlation")

image

4) Histogram plot to display the number of songs available per year.

  • From the visualization, we can observe that there was an increasing trend in the songs available from the the year 1980 towards the year 2000.However,by the year 2000, there was a drastic drop in numbers followed by a steady growth towards the year 2010.

  • Code:

    df_tracks["dates"]=df_tracks.index.get_level_values("release_date")

    df_tracks.dates=pd.to_datetime(df_tracks.dates)

    years=df_tracks.dates.dt.year

    sns.displot(years,discrete=True,aspect=2,height=5,kind="hist").set(title="Number of songs per year.")

image

5) Line graph to showcase the average duration of songs over the years.

  • From the visualization, we can observe that there was a fluctuating upwards and downward trend in relation to the the duration of songs throughout the time period.However, over the last 20 years(from 2000-2020), it is evident that there has been a steady decline in the songs' duration.

  • Code:

    total_dr=df_tracks.duration

    sns.set_style(style="whitegrid")

    fig_dims=(10,5)

    fig,ax=plt.subplots(figsize=fig_dims)

    fig=sns.lineplot(x=years,y=total_dr,ax=ax).set(title="Year Vs Duration")

    plot.xticks(rotation=90)

image

python_project_1's People

Contributors

lauramutheu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.