BikeShare in SQL

Note: A more in-depth EDA was done using R and can be found at BikeShare in the R section of this site. Explanation of the data and case study can be found at the link as well.

BigQuery Connection

library(bigrquery)       # R Interface to Google BigQuery API  
library(dplyr)           # Grammar for data manipulation  
library(DBI) 
projectid <- 'prime-depot-415622'
datasetid <- 'BikeShare'
con <- dbConnect((bigquery()),
                 project = projectid,
                 dataset = datasetid,
                 use_legacy_sql = FALSE)

List Tables

  • Let’s see if city_data is the dataset we are looking for
  • Let’s list the tables in it
SELECT table_catalog, table_schema, table_name, table_type, is_insertable_into
FROM    BikeShare.INFORMATION_SCHEMA.TABLES
2 records
table_catalog table_schema table_name table_type is_insertable_into
prime-depot-415622 BikeShare trips19v1 BASE TABLE YES
prime-depot-415622 BikeShare trips20 BASE TABLE YES

List Columns

SELECT  column_name, ordinal_position, is_nullable, data_type 
FROM   INFORMATION_SCHEMA.COLUMNS
WHERE  table_name = 'trips19v1'
Displaying records 1 - 10
column_name ordinal_position is_nullable data_type
ride_id 1 YES INT64
started_at 2 YES TIMESTAMP
ended_at 3 YES TIMESTAMP
ride_length 4 YES STRING
rideable_type 5 YES INT64
tripduration 6 YES FLOAT64
start_station_id 7 YES INT64
start_station_name 8 YES STRING
end_station_id 9 YES INT64
end_station_name 10 YES STRING
SELECT  column_name, ordinal_position, is_nullable, data_type 
FROM   INFORMATION_SCHEMA.COLUMNS
WHERE  table_name = 'trips20'
Displaying records 1 - 10
column_name ordinal_position is_nullable data_type
ride_id 1 YES STRING
rideable_type 2 YES STRING
started_at 3 YES TIMESTAMP
ended_at 4 YES TIMESTAMP
start_station_name 5 YES STRING
start_station_id 6 YES INT64
end_station_name 7 YES STRING
end_station_id 8 YES INT64
start_lat 9 YES FLOAT64
start_lng 10 YES FLOAT64

EDA


Rides per User

SELECT
   subquery2.member_casual,
   COUNT(*) AS trip_count
FROM
  (
  SELECT
     subquery.member_casual,
     EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
     EXTRACT(MONTH FROM subquery.started_at) AS month,
     EXTRACT(YEAR FROM subquery.started_at) AS year,
     EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
     EXTRACT(DATE FROM subquery.started_at) AS date,
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
  FROM
     ( 
     SELECT
       CAST(ride_id AS STRING) AS ride_id, 
       started_at,
       ended_at,
       start_station_id, 
       start_station_name,
       end_station_name,
       end_station_id,
       member_casual
     FROM 
       trips19v1
     UNION ALL
     SELECT
       ride_id,
       started_at,
       ended_at,
       start_station_id,
       start_station_name,
       end_station_name,
       end_station_id,
       CASE
           WHEN  member_casual = 'member' THEN 'Subscriber'
           WHEN  member_casual = 'casual' THEN 'Customer'
           ELSE  member_casual
       END
     FROM trips20
     ) AS subquery
   WHERE 
     (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
     AND
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0
   ) AS subquery2
GROUP BY
subquery2.member_casual;
2 records
member_casual trip_count
Subscriber 720311
Customer 67877

Group Rides

  • Group the rides by user
  • Separate by day of the week

NOTE: this chunk is the same as the one right above it, with the exception of the last few lines I marked, which were added to group the results

SELECT
   subquery2.member_casual,
   subquery2.day_of_week,
   COUNT(*) AS trip_count
FROM
  (
  SELECT
     subquery.member_casual,
     EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
     EXTRACT(MONTH FROM subquery.started_at) AS month,
     EXTRACT(YEAR FROM subquery.started_at) AS year,
     EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
     EXTRACT(DATE FROM subquery.started_at) AS date,
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
  FROM
     ( 
     SELECT
       CAST(ride_id AS STRING) AS ride_id, 
       started_at,
       ended_at,
       start_station_id, 
       start_station_name,
       end_station_name,
       end_station_id,
       member_casual
     FROM trips19v1
     UNION ALL
     SELECT
       ride_id,
       started_at,
       ended_at,
       start_station_id,
       start_station_name,
       end_station_name,
       end_station_id,
       CASE
         WHEN  member_casual = 'member' THEN 'Subscriber'
         WHEN  member_casual = 'casual' THEN 'Customer'
         ELSE  member_casual
       END
     FROM trips20
     ) AS subquery
   WHERE 
     (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
     AND
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0
   ) AS subquery2

GROUP BY
subquery2.member_casual,
# ----- The next few lines are the only lines that were added to the above chunk of code
subquery2.day_of_week

ORDER BY
subquery2.day_of_week,
subquery2.member_casual;
Displaying records 1 - 10
member_casual day_of_week trip_count
Customer 1 18652
Subscriber 1 60196
Customer 2 5591
Subscriber 2 110430
Customer 3 7311
Subscriber 3 127974
Customer 4 7690
Subscriber 4 121902
Customer 5 7147
Subscriber 5 125228

Breakdown by Quarter


SELECT
   subquery2.quarter,
   subquery2.member_casual,
   subquery2.day_of_week,
   COUNT(*) AS trip_count
FROM
  (
  SELECT
     subquery.member_casual,
     EXTRACT(DAYOFWEEK FROM subquery.started_at) AS day_of_week,
     EXTRACT(MONTH FROM subquery.started_at) AS month,
     EXTRACT(YEAR FROM subquery.started_at) AS year,
     EXTRACT(QUARTER FROM subquery.started_at) AS quarter,
     EXTRACT(DATE FROM subquery.started_at) AS date,
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) AS ride_length
  FROM
     ( 
     SELECT
       CAST(ride_id AS STRING) AS ride_id, 
       started_at,
       ended_at,
       start_station_id, 
       start_station_name,
       end_station_name,
       end_station_id,
       member_casual
     FROM trips19v1
     UNION ALL
     SELECT
       ride_id,
       started_at,
       ended_at,
       start_station_id,
       start_station_name,
       end_station_name, 
       end_station_id,
       CASE
         WHEN  member_casual = 'member' THEN 'Subscriber'
         WHEN  member_casual = 'casual' THEN 'Customer'
         ELSE  member_casual
       END
     FROM trips20
     ) AS subquery
   WHERE 
     (start_station_name <> "HQ QR" AND end_station_name <> "HQ QR")
     AND
     TIMESTAMP_DIFF(subquery.ended_at, subquery.started_at, SECOND) > 0
   ) AS subquery2

GROUP BY
subquery2.quarter,
subquery2.member_casual,
subquery2.day_of_week

ORDER BY
subquery2.quarter,
subquery2.day_of_week,
subquery2.member_casual;
Displaying records 1 - 10
quarter member_casual day_of_week trip_count
1 Customer 1 18652
1 Subscriber 1 60196
1 Customer 2 5591
1 Subscriber 2 110430
1 Customer 3 7311
1 Subscriber 3 127974
1 Customer 4 7690
1 Subscriber 4 121902
1 Customer 5 7147
1 Subscriber 5 125228

Same Station Rides

# ________ We can use the code above and substitute the 
WHERE     
        subquery2.start_station_id = subquery2.start_station_id
        
        
OUTPUT
23626

Observations

  • If we group trip_count and sort it in order, a glaring figure jumps out at us:
  • the top 3 stations in rides (by 50% more than the fourth place are in this order: Station # 192, 91, 77
1 Subscriber 192 13799
2 Subscriber 91 13434
3 Subscriber 77 12891
4 Subscriber 133 8720
  • If we add the other usertype to the data and recalculate we get:
1 192 14155
2 91 1640
3 77 13362
  • As you see the Customer users account for:
    • 356 at station 192 out of the total of 14155 trips
    • 206 at station 91 out of the total of 13640 trips
    • 471 at station 133 out of the total of 13362

TABLEAU


Total Rides per Day

Rides per Day/Year

  • Let’s see what has happened from year to year for the same quarter Q1.
  • It appears that the number of rides have gone up for all days EXCEPT Thursday and Fridays, where we see the number of rides have  decreased

Rides per Year

Location of Top 5

  • Let’s see if the busiest 5 stations are located near a landmark that can help us improve our service

Top 5 Starting Stations

  • Let’s identify the top 5 performing starting stations
  • Maybe the locations can help us understand our users

Interactive Top 5 Graphs

  • I’ve gathered multiple top 5 charts together to help me speed up the understanding of the usage
  • Just click on the tabs to change views

Top 5 Ending Stations

  • I had identified the top 5 starting stations how about we look at the top 5 ending stations
  • Maybe this will help us direct our bike availablity or
  • Maybe we can better understand our user usage
  • Once again I gathered multiple charts together for easier exploration

Top 5 Start=End

  • Let’s see if there is a pattern between starting and ending stations
  • Let’s look how often our user return the bikes to the start station

Observations


  • The company claims that “Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day”, how is that numbered calculated?
  • The two plans offered to casual riders: single-ride passes and full-day passes are not tracked in the data provided, all we have is user type: “Customer” or “Subscriber”
  • The data provided doesn’t link any of the rides to an actual account. Why don’t we track by userId? Don’t Subscribers have to create an account? Why are we just sorting the data by “Member Casual”.
  • We can gather more insight if the rideId is linked to a userId, is the data available but not provided for analysis?
  • How are docking station locations related to subway, train, or bus stations?
  • Are docking stations near parking structures or major tourist, business centers?
  • Are users coming to the city via other mode of transportation and using bikes within the city for commuting?
  • The objective of the project is to find ways to convert “Customers” to “Subscribers” which is logical except the data provided doesn’t support the theory proposed by Mrs. Moreno. The data shows a large discrepancy between the two users, as Customers account to 9% of Subscribers usage. See table below.
  • The data does not show how many actual users are represented by each type. It is illogical to use rideId as userId. The data provided does not promote an insightful analysis regarding the particular hypothesis presented.
  • The idea that converting 9% of rides will improve the bottom line, sure but at what cost? How many users are we converting? How many Subscribers do we already have?
  • Considering the fact that I am not provided with data relevant to the proposed hypothesis, I would shift my focus on other issues that the data has exposed.
  • The facts that weekday ride numbers have been on the decrease for the busiest stations is alarming (see the last table below). The top stations are very close in proximity to one another, so it is possible that users are not getting good service or possibly a competitor has entered the arena and is targeting that specific small area where the top stations are located. Maybe inventory at those stations doesn’t support the volume of rides initiated from there?
  •  The fact that inventory at those stations is not tracked needs to be addressed.
  • The top stations are far more important to the bottom line than wasting resources on the hypothesis that has been proposed, with the data provided. We cannot worry about converting 9% of rides while the top 3 stations are losing rides at a higher pace and by larger numbers than 9%.

Recommendations


  • Modify the data we track.
  • Implement userId so we can focus our analysis and be more insightful.
  • Focus on UX at the top stations.
  • Stabilize ride numbers at the top stations.
  • Reassess the situation in the near future after modifications have been implemented.