Cyclistic is a successful bike-share offering company. It boasts a fleet of 5,824 bicycles that are geo-tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system any time during the day. Cyclistic’s marketing strategy relies on building general awareness and appealing to broad consumer segments.One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
The data that we are looking has been collected by the geo-trackers equipped on each of the bikes. Each trip that the bikes took have been compiled, containing notable measurements such as start time, end time, start station, end station, and the user type (Member or Casual).
##The Business (T)ASK
How do annual members and casual riders use Cyclistic bikes differently? The business task at hand is to answer the above question with the objective of increasing membership in mind. Analyzing the data will and answering this question can help the company make business decisions that will positively impact the company’s revenue.
To specifically answer the question, we will need to inspect and prepare the data to help us accurately find the answers.
We’ll use a few handy libraries to assist us with data wrangling and cleaning.
library(tidyverse) #helps wrangle data
library(lubridate) #helps wrangle date attributes
library(ggplot2) #helps visualize data
This is the most recent available data from 2019 Q2 to 2020 Q1.
setwd('C:/Users/Davey Wong/Desktop/Divvy_Data')
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")
#Taking a quick look at the column names from these 4 separate files, we can easily see that that column names do no match, which will make it difficult to join them together.
colnames(q3_2019)
colnames(q4_2019)
colnames(q2_2019)
colnames(q1_2020)
Using the rename function, we can rewrite the column names so that they are exact matches with the other files that we are working with. This will allow us to easily combine the data sets. Using the mutate function we can convert certain columns to the correct data types so that they will not be named the same thing but separated by data type. (code not shown for brevity)
We now combine the 4 csv files into one dataframe that we can easily work with all the data at once in.
all_trips <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2020)
#We don't need these columns of data for our analysis, so we are simply removing them
all_trips <- all_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))
Let’s take a look and see how our data looks.
## ride_id started_at
## Length:3879822 Min. :2019-04-01 00:02:22.00
## Class :character 1st Qu.:2019-06-23 07:49:09.25
## Mode :character Median :2019-08-14 17:43:38.00
## Mean :2019-08-26 00:49:59.38
## 3rd Qu.:2019-10-12 12:10:21.00
## Max. :2020-03-31 23:51:34.00
##
## ended_at rideable_type start_station_id
## Min. :2019-04-01 00:09:48.00 Length:3879822 Min. : 1.0
## 1st Qu.:2019-06-23 08:20:27.75 Class :character 1st Qu.: 77.0
## Median :2019-08-14 18:02:04.00 Mode :character Median :174.0
## Mean :2019-08-26 01:14:37.06 Mean :202.9
## 3rd Qu.:2019-10-12 12:36:16.75 3rd Qu.:291.0
## Max. :2020-05-19 20:10:34.00 Max. :675.0
##
## start_station_name end_station_id end_station_name member_casual
## Length:3879822 Min. : 1.0 Length:3879822 Length:3879822
## Class :character 1st Qu.: 77.0 Class :character Class :character
## Mode :character Median :174.0 Mode :character Mode :character
## Mean :203.8
## 3rd Qu.:291.0
## Max. :675.0
## NA's :1
A few things that we notice from looking at our data set as a whole: 1.In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate the labels so that there are only two labels as opposed to four.
We will want to add some additional columns of data – such as day, month, year – that provide additional opportunities to aggregate the data.
We will want to add a calculated field for length of ride called “ride_length”.
We will want to delete the rides where tripduration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons.
#Here we are reassigning any values in the member_casual column ("Subscriber" becomes "member", "Customer" becomes "Casual")
all_trips <- all_trips %>%
mutate(member_casual = recode(member_casual
,"Subscriber" = "member"
,"Customer" = "casual"))
#Creating new columns that split up the dates by their individual parts (month, day, year, day_of_week)
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
#Creating a calculated column called ride_length, which will be the duration of each bike trip in seconds
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
#Create v2 of our data set to exclude the entries with a negative ride_length value
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
Let’s take a look at our cleaned up data set
## # A tibble: 6 × 15
## ride_id started_at ended_at rideable_type start…¹ start…²
## <chr> <dttm> <dttm> <chr> <dbl> <chr>
## 1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48 6251 81 Daley …
## 2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30 6226 317 Wood S…
## 3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19 5649 283 LaSall…
## 4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58 4151 26 McClur…
## 5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13 3270 202 Halste…
## 6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56 3123 420 Ellis …
## # … with 9 more variables: end_station_id <dbl>, end_station_name <chr>,
## # member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,
## # day_of_week <chr>, ride_length <dbl>, and abbreviated variable names
## # ¹start_station_id, ²start_station_name
## # ℹ Use `colnames()` to see all variable names
Now that our data set is cleaned and ready to use, we can perform some initial descriptive analysis. Our mean ride length is 1479.14 seconds. Our max ride length is 9387024.00 seconds, while our min ride length is 1 second.
Let’s take a look at some more specific metrics comparing the member riders to the casual riders.
#Aggregate the data based on average ride length for members versus casuals
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 3552.7502
## 2 member 850.0662
#Rearrange data so that the days of the week are in order
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
#Aggregate the data based on average ride length for members versus casuals
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 casual Sunday 3581.4054
## 2 member Sunday 919.9746
## 3 casual Monday 3372.2869
## 4 member Monday 842.5726
## 5 casual Tuesday 3596.3599
## 6 member Tuesday 826.1427
## 7 casual Wednesday 3718.6619
## 8 member Wednesday 823.9996
## 9 casual Thursday 3682.9847
## 10 member Thursday 823.9278
## 11 casual Friday 3773.8351
## 12 member Friday 824.5305
## 13 casual Saturday 3331.9138
## 14 member Saturday 968.9337
Let’s investigate a bit further.
#This will allow us to see a side by side comparison of casual vs. member riders and their weekdays of use and also their average ride duration
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday) %>% #groups by 'usertype' and 'weekday'
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, weekday)
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 181293 3581.
## 2 casual Mon 103296 3372.
## 3 casual Tue 90510 3596.
## 4 casual Wed 92457 3719.
## 5 casual Thu 102679 3683.
## 6 casual Fri 122404 3774.
## 7 casual Sat 209543 3332.
## 8 member Sun 267965 920.
## 9 member Mon 472196 843.
## 10 member Tue 508445 826.
## 11 member Wed 500329 824.
## 12 member Thu 484177 824.
## 13 member Fri 452790 825.
## 14 member Sat 287958 969.
We can see some clear differences between the days of use and the length of use between casual and member riders. Let’s visualize those differences.
From these visualizations we can see that there is a stark contrast between the casual riders and the member riders. Notably, the casual riders have a higher volume of rides on the weekends where as the member riders typically ride the most during the weekday; this would make the most sense because members tend to be work commuters. The second plot displays the major difference in average bike ride duration. Casual members seem to ride more leisurely, hence the weekend-heavy volume and the longer-duration rides.
To attract casual riders into becoming members, there must be a pricing adjustment targeted for those that ride less days a week but for a longer duration. This pricing must be favorable to the day pass pricing, which likely is more cost-efficient for the casual members biking only twice a week on the weekends.
We can export the data for further analysis to confirm.
rides <- all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
write.csv(rides, file='ride.csv')