In the following project we will be implementing and end to end project where we willl be scrapping data from AMAZON WEBSITE specifically we will be scrapping books and after scrapping this data we will have to clean this data after cleaning this data we will load this data to a database and query our data using SQL after querying our data we will get tables and after getting tables we will load thsese tables to POWER BI and visualize the data. Here is the outline of our project is as follows:
-
scrapping the data using PYTHON FROM AMAZON WEBSITE
-
Cleaning our data
-
Perfoming an EDA
-
Loading the data to SQL and querying to form tables
-
Loading the queried tables to TABLEAU and forming vizualizations
The process of scrapping the data from the amazon website(SELENIUM) i have done it clearly and explained bit by bit and even on ways to read the HTML tags and extracting the data to the process of forming a csv file , I have clearly outlined the steps in the below article.You can give it a like and a clap
Link to the Web Scraping Article on Medium
Thus in this Readme file we will be covering the DATA CLEANING AND EDA part ,the vizualizations will be further covered in SQL and POWERBI
Here is basically what we have done to clean our data we have looked at the isnull values and we can see that the following columns have nuull values
- total reviews
- type
- price
We can also see that our datatypes are wrong and we have to convert the rating to an interger and price to an interger ,the rating we have to remove the string information to be left with the float object
We will also fill the null values in the prices and total reviews column with the median (mostly preffered since it does not introduce skewness to our data .
We will be analyzing different columns and identifying trends and probably relationships between our columns when we plot outr correlation matrix
Also in the following distplot we will look at the distribution of the total reviews column and how my data is distributed and we can see that our data in this column is left skewed
In the following column we will look at the distribution of the price and we can see that our data is not skewed into any direction athough most of our books range from 0-400 shillings
We can see that our total reviews is related to the price column and this is ideally always the case a books that has been reviewd by many people means has been read by many people thus its price will likely be higher , the other columns have no correlation and we can see that the rating of a book is not directly mean a higher price of a book
From the cleaned data we have extracted we will be perfoming various queries using SQL and we will try to visualize the tables and output using powerbi
These are the queries we have Perfomed
-
we willl be selecting the min and max values from our rating column
-
we will be selecting the min and max values from our price column
-
we will be selecting the min and max values from our total reviews column
-
we will be looking at the distribution of the book type
-
we will be looking at the distribution of type of books and wwe willl be grouping by the price
-
we will be looking at the distribution of the type of books and we will be grouping based on reviews we will be lookin which tpe og books had the most reviews
-
we will be looking at the distribution of the type book column based on ratings we will look which type of book cover had the most ratings
-
we will be looking at the distribution of the price columns since the lowest is 0 and the highest is 680 the second case statement we are ranking the groups in ascending order from 0-100 to the last group
-
in the following we will be loooking at the distribution of the total reviews from 0 to 540 as it is our highest value
- we will be visualizing the following queries in powerbi and i will be providing the powerbi visuals file and aslo the powerbi file