The Company

Cyclistic is a successful bike-share offering company. It boasts a fleet of 5,824 bicycles that are geo-tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system any time during the day. Cyclistic’s marketing strategy relies on building general awareness and appealing to broad consumer segments.One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

The Data

The data that we are looking has been collected by the geo-trackers equipped on each of the bikes. Each trip that the bikes took have been compiled, containing notable measurements such as start time, end time, start station, end station, and the user type (Member or Casual).

##The Business (T)ASK

How do annual members and casual riders use Cyclistic bikes differently? The business task at hand is to answer the above question with the objective of increasing membership in mind. Analyzing the data will and answering this question can help the company make business decisions that will positively impact the company’s revenue.

Data Wrangling and Cleaning

To specifically answer the question, we will need to inspect and prepare the data to help us accurately find the answers.

We’ll use a few handy libraries to assist us with data wrangling and cleaning.

library(tidyverse)  #helps wrangle data
library(lubridate)  #helps wrangle date attributes
library(ggplot2)  #helps visualize data

Reading in the Data with read.csv

This is the most recent available data from 2019 Q2 to 2020 Q1.

setwd('C:/Users/Davey Wong/Desktop/Divvy_Data')
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")
#Taking a quick look at the column names from these 4 separate files, we can easily see that that column names do no match, which will make it difficult to join them together. 
colnames(q3_2019)
colnames(q4_2019)
colnames(q2_2019)
colnames(q1_2020)

Data Wrangling

Using the rename function, we can rewrite the column names so that they are exact matches with the other files that we are working with. This will allow us to easily combine the data sets. Using the mutate function we can convert certain columns to the correct data types so that they will not be named the same thing but separated by data type. (code not shown for brevity)

Creating one big dataset

We now combine the 4 csv files into one dataframe that we can easily work with all the data at once in.

all_trips <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2020) 

#We don't need these columns of data for our analysis, so we are simply removing them 
all_trips <- all_trips %>%  
  select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))

Data Cleaning

Let’s take a look and see how our data looks.

##    ride_id            started_at                    
##  Length:3879822     Min.   :2019-04-01 00:02:22.00  
##  Class :character   1st Qu.:2019-06-23 07:49:09.25  
##  Mode  :character   Median :2019-08-14 17:43:38.00  
##                     Mean   :2019-08-26 00:49:59.38  
##                     3rd Qu.:2019-10-12 12:10:21.00  
##                     Max.   :2020-03-31 23:51:34.00  
##                                                     
##     ended_at                      rideable_type      start_station_id
##  Min.   :2019-04-01 00:09:48.00   Length:3879822     Min.   :  1.0   
##  1st Qu.:2019-06-23 08:20:27.75   Class :character   1st Qu.: 77.0   
##  Median :2019-08-14 18:02:04.00   Mode  :character   Median :174.0   
##  Mean   :2019-08-26 01:14:37.06                      Mean   :202.9   
##  3rd Qu.:2019-10-12 12:36:16.75                      3rd Qu.:291.0   
##  Max.   :2020-05-19 20:10:34.00                      Max.   :675.0   
##                                                                      
##  start_station_name end_station_id  end_station_name   member_casual     
##  Length:3879822     Min.   :  1.0   Length:3879822     Length:3879822    
##  Class :character   1st Qu.: 77.0   Class :character   Class :character  
##  Mode  :character   Median :174.0   Mode  :character   Mode  :character  
##                     Mean   :203.8                                        
##                     3rd Qu.:291.0                                        
##                     Max.   :675.0                                        
##                     NA's   :1

A few things that we notice from looking at our data set as a whole: 1.In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate the labels so that there are only two labels as opposed to four.

  1. We will want to add some additional columns of data – such as day, month, year – that provide additional opportunities to aggregate the data.

  2. We will want to add a calculated field for length of ride called “ride_length”.

  3. We will want to delete the rides where tripduration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons.

#Here we are reassigning any values in the member_casual column ("Subscriber" becomes "member", "Customer" becomes "Casual")
all_trips <-  all_trips %>% 
  mutate(member_casual = recode(member_casual
                           ,"Subscriber" = "member"
                           ,"Customer" = "casual"))

#Creating new columns that split up the dates by their individual parts (month, day, year, day_of_week)
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

#Creating a calculated column called ride_length, which will be the duration of each bike trip in seconds
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))

#Create v2 of our data set to exclude the entries with a negative ride_length value
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

Let’s take a look at our cleaned up data set

## # A tibble: 6 × 15
##   ride_id  started_at          ended_at            rideable_type start…¹ start…²
##   <chr>    <dttm>              <dttm>              <chr>           <dbl> <chr>  
## 1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48 6251               81 Daley …
## 2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30 6226              317 Wood S…
## 3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19 5649              283 LaSall…
## 4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58 4151               26 McClur…
## 5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13 3270              202 Halste…
## 6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56 3123              420 Ellis …
## # … with 9 more variables: end_station_id <dbl>, end_station_name <chr>,
## #   member_casual <chr>, date <date>, month <chr>, day <chr>, year <chr>,
## #   day_of_week <chr>, ride_length <dbl>, and abbreviated variable names
## #   ¹​start_station_id, ²​start_station_name
## # ℹ Use `colnames()` to see all variable names

Conducting Descriptive Analysis

Now that our data set is cleaned and ready to use, we can perform some initial descriptive analysis. Our mean ride length is 1479.14 seconds. Our max ride length is 9387024.00 seconds, while our min ride length is 1 second.

Let’s take a look at some more specific metrics comparing the member riders to the casual riders.

#Aggregate the data based on average ride length for members versus casuals
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                3552.7502
## 2                     member                 850.0662
#Rearrange data so that the days of the week are in order
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

#Aggregate the data based on average ride length for members versus casuals
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
##    all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                   Sunday                3581.4054
## 2                      member                   Sunday                 919.9746
## 3                      casual                   Monday                3372.2869
## 4                      member                   Monday                 842.5726
## 5                      casual                  Tuesday                3596.3599
## 6                      member                  Tuesday                 826.1427
## 7                      casual                Wednesday                3718.6619
## 8                      member                Wednesday                 823.9996
## 9                      casual                 Thursday                3682.9847
## 10                     member                 Thursday                 823.9278
## 11                     casual                   Friday                3773.8351
## 12                     member                   Friday                 824.5305
## 13                     casual                 Saturday                3331.9138
## 14                     member                 Saturday                 968.9337

Let’s investigate a bit further.

#This will allow us to see a side by side comparison of casual vs. member riders and their weekdays of use and also their average ride duration 
all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%  #creates weekday field using wday()
  group_by(member_casual, weekday) %>%  #groups by 'usertype' and 'weekday'
  summarise(number_of_rides = n()                           #calculates the number of rides and average duration 
  ,average_duration = mean(ride_length)) %>%        # calculates the average duration
  arrange(member_casual, weekday)       
## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>         <ord>             <int>            <dbl>
##  1 casual        Sun              181293            3581.
##  2 casual        Mon              103296            3372.
##  3 casual        Tue               90510            3596.
##  4 casual        Wed               92457            3719.
##  5 casual        Thu              102679            3683.
##  6 casual        Fri              122404            3774.
##  7 casual        Sat              209543            3332.
##  8 member        Sun              267965             920.
##  9 member        Mon              472196             843.
## 10 member        Tue              508445             826.
## 11 member        Wed              500329             824.
## 12 member        Thu              484177             824.
## 13 member        Fri              452790             825.
## 14 member        Sat              287958             969.

We can see some clear differences between the days of use and the length of use between casual and member riders. Let’s visualize those differences.

Conclusions

From these visualizations we can see that there is a stark contrast between the casual riders and the member riders. Notably, the casual riders have a higher volume of rides on the weekends where as the member riders typically ride the most during the weekday; this would make the most sense because members tend to be work commuters. The second plot displays the major difference in average bike ride duration. Casual members seem to ride more leisurely, hence the weekend-heavy volume and the longer-duration rides.

Business Insight

To attract casual riders into becoming members, there must be a pricing adjustment targeted for those that ride less days a week but for a longer duration. This pricing must be favorable to the day pass pricing, which likely is more cost-efficient for the casual members biking only twice a week on the weekends.

Further Analysis

We can export the data for further analysis to confirm.

rides <- all_trips_v2 %>% 
      mutate(weekday = wday(started_at, label = TRUE)) %>%  
      group_by(member_casual, weekday) %>% 
      summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%         
      arrange(member_casual, weekday)


write.csv(rides, file='ride.csv')