The purpose of this document is to consolidate downloaded Divvy data into a single dataframe and then conduct simple analysis to help answer the key question: “In what ways do members and casual riders use Divvy bikes differently?”
This exploratory analysis case study is towards Capstome project requirement for [Google Data Analytics Professional Certificate]. The case study involves a bikeshare company's data of its customer's trip details over a 12 month period for 2020.
The analysis will follow the 6 phases of the Data Analysis process: Ask, Prepare, Process, Analyze, and Act. A brief explanation of these processes:
- Ask effective questions
- Define the scope of the analysis
- Define what success looks like
- Verify data’s integrity
- Check data credibility and reliability
- Check data types
- Merge datasets
- Clean, Remove and Transform data
- Document cleaning processes and results
- Identify patterns
- Draw conclusions
- Make predictions
- Create effective visuals
- Create a story for data
- Share insights to stakeholders
- Give recommendations based on insights
- Solve problems
- Create something new
Marketing team needs to design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ.
- Director of marketing
- Cyclistic executive team
Hence, the objective for this analysis is to throw some light on how the two types of customers: annual members and casual riders, use Cyclistic bikeshare differently, based on few parameters that can be calculated/ obtained from existing data.
- Insights on how annual members and casual riders use Cyclistic bikes differently
- Provide effective visuals and relevant data to support insights
- Use insights to give three recommendations to convert casual riders to member riders
A total of 12 CSV files have been made available for each month starting from January 2020 to December 2020. Each file captures the details of every ride logged by the customers of Cyclistic. This data that has been made publicly available has been scrubbed to omit rider's personal information.
The combined size of all the 12 CSV files is close to 950 MB. Data cleaning in spreadsheets will be time-consuming and slow compared to R. I am choosing R simply because I could do both data wrangling and analysis/ visualizations in the same platform.
library(tidyverse)
library(ggplot2)
library(lubridate)
library(dplyr)
library(readr)
library(janitor)
library(data.table)
library(tidyr)
tripdata_Q1_2020 <- read.csv("C:/Users/PSagar/Downloads/Divvy_trip_data_FY2020/Divvy_Trips_2020_Q1.csv")
tripdata_202004 <- read.csv("C:/Users/PSagar/Downloads/Divvy_trip_data_FY2020/202004-divvy-tripdata.csv")
tripdata_202005 <- read.csv("C:/Users/PSagar/Downloads/Divvy_trip_data_FY2020/202005-divvy-tripdata.csv")
tripdata_202006 <- read.csv("C:/Users/PSagar/Downloads/Divvy_trip_data_FY2020/202006-divvy-tripdata.csv")
tripdata_202007 <- read.csv("C:/Users/PSagar/Downloads/Divvy_trip_data_FY2020/202007-divvy-tripdata.csv")
tripdata_202008 <- read.csv("C:/Users/PSagar/Downloads/Divvy_trip_data_FY2020/202008-divvy-tripdata.csv")
tripdata_202009 <- read.csv("C:/Users/PSagar/Downloads/Divvy_trip_data_FY2020/202009-divvy-tripdata.csv")
tripdata_202010 <- read.csv("C:/Users/PSagar/Downloads/Divvy_trip_data_FY2020/202010-divvy-tripdata.csv")
tripdata_202011 <- read.csv("C:/Users/PSagar/Downloads/Divvy_trip_data_FY2020/202011-divvy-tripdata.csv")
tripdata_202012 <- read.csv("C:/Users/PSagar/Downloads/Divvy_trip_data_FY2020/202012-divvy-tripdata.csv")
colnames(tripdata_Q1_2020)
colnames(tripdata_202004)
colnames(tripdata_202005)
colnames(tripdata_202006)
colnames(tripdata_202007)
colnames(tripdata_202008)
colnames(tripdata_202009)
colnames(tripdata_202010)
colnames(tripdata_202011)
colnames(tripdata_202012)
str(tripdata_202012)
start_station_id & end_station_id are not consistent in tripdata_202012.csv. Convert the inconsistent ones from char to int datatype since all other dataframe are in this format.
tripdata_202012 <- mutate(tripdata_202012, start_station_id = as.integer(start_station_id), end_station_id = as.integer(end_station_id))
all_trips <- bind_rows(tripdata_Q1_2020, tripdata_202004, tripdata_202005, tripdata_202006, tripdata_202007, tripdata_202008, tripdata_202009, tripdata_202010, tripdata_202011, tripdata_202012)
all_trips <- all_trips %>%
rename(trip_id = ride_id
, bike_id = rideable_type
, start_time = started_at
, end_time = ended_at
, from_station_name = start_station_name
, from_station_id = start_station_id
, to_station_name = end_station_name
, to_station_id = end_station_id
, usertype = member_casual
)
str(all_trips)
all_trips <- all_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng))
colnames(all_trips)
nrow(all_trips)
dim(all_trips)
head(all_trips)
str(all_trips)
summary(all_trips)
table(all_trips$usertype)
all_trips$date <- as.Date(all_trips$start_time)
all_trips$day <- format(as.Date(all_trips$date),"%d")
all_trips$month <- format(as.Date(all_trips$date),"%m")
all_trips$year <- format(as.Date(all_trips$date),"%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date),"%A")
all_trips$ride_duration <- difftime(all_trips$end_time,all_trips$start_time)
str(all_trips)
is.factor(all_trips$ride_duration)
all_trips$ride_duration <- as.numeric(as.character(all_trips$ride_duration))
is.numeric(all_trips$ride_duration)
all_trips_v2 <- all_trips[!(all_trips$from_station_name == "HQ QR" | all_trips$ride_duration<0),]
mean(all_trips_v2$ride_duration)
median(all_trips_v2$ride_duration)
max(all_trips_v2$ride_duration)
min(all_trips_v2$ride_duration)
summary(all_trips_v2$ride_duration)
aggregate(all_trips_v2$ride_duration ~ all_trips_v2$usertype, FUN = mean)
aggregate(all_trips_v2$ride_duration ~ all_trips_v2$usertype, FUN = median)
aggregate(all_trips_v2$ride_duration ~ all_trips_v2$usertype, FUN = max)
aggregate(all_trips_v2$ride_duration ~ all_trips_v2$usertype, FUN = min)
#see the average ride duration by each day for members vs casual users
aggregate(all_trips_v2$ride_duration ~ all_trips_v2$usertype + all_trips_v2$day_of_week, FUN = mean)
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday","Monday","Tuesday","Wednesday","Thrusday","Friday","Saturday"))
aggregate(all_trips_v2$ride_duration ~ all_trips_v2$usertype + all_trips_v2$day_of_week, FUN = mean)
all_trips_v2 %>%
mutate(weekday = wday(start_time,label = TRUE)) %>% #creates weekday field using wday func
group_by(usertype,weekday) %>% #groups by usertype and weekday
summarise(number_of_rides = n(),average_duration = mean(ride_duration)) %>% #calculates the number of rides and average duration
arrange(usertype,weekday) #sorts usertype and weekday
all_trips_v2 %>%
mutate(weekday = wday(start_time, label = TRUE)) %>%
group_by(usertype, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_duration)) %>%
arrange(usertype, weekday) %>%
ggplot(aes(x=weekday,y=number_of_rides,fill=usertype)) + geom_col(position = "dodge")
all_trips_v2 %>%
mutate(weekday = wday(start_time, label = TRUE)) %>%
group_by(usertype, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_duration)) %>%
arrange(usertype, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = usertype)) + geom_col(position = "dodge")
-
Casual riders made 41% of total trips contributing to 66% of total trip duration between Jan'20 - Dec'20. Member riders make up 59% of total trips contributing to 34% of total trip duration between Jan'20 - Dec'20
-
Usage (based on trip duration) of bikes by casual riders is almost twice that of member riders.
-
Casual customers use bikeshare services more during weekends, while members use them consistently over the entire week.
-
Average trip duration of casual riders is more than twice that of member rider over any given day of the week cumulatively.
-
Casual riders ride longer during first half of the year compared to the second half, while members clock relatively similar average trip duration month over month.
-
Casual riders prefer electric bikes the most while classic bikes are popular among members.
-
Provide attractive promotions for casual riders on weekdays so that casual members use the bikeshare services more uniformly across the entire week.
-
Offer discounted membership fee for renewals after the first year. It might attract casual riders to take up membership.
-
Offer discounted pricing during non-busy hours so that casual riders might choose to use bikes more often and level out demand over the day.