NY Bikeshare - BigQuery

I am using data stored in BigQuery to demonstrate how to connect to BigQuery and run either R or SQL queries from RStudio.

BigQuery Connection

library(bigrquery)       # R Interface to Google BigQuery API  
library(dplyr)           # Grammar for data manipulation  
library(DBI) 
projectid <- 'prime-depot-415622'
datasetid <- 'city_data'
con <- dbConnect((bigquery()),
                 project = projectid,
                 dataset = datasetid,
                 use_legacy_sql = FALSE)

List Datasets

NOTE: Since we are using Rstudio to connect to BigQuery and we are executing the queries in separate chunks - EACH chunk querying BigQuery need to have connection declaration in it: #| connection = con We use con because we already defined it as such in the first code chunk

SELECT  *  
        EXCEPT(schema_owner, location, DDL, default_collation_name)   
FROM    
        INFORMATION_SCHEMA.SCHEMATA

Displaying records 1 - 10
catalog_name	schema_name	creation_time	last_modified_time	sync_status
prime-depot-415622	cus_data	2024-03-03 21:24:36	2024-03-03 21:24:36	NULL
prime-depot-415622	ny_citibikes	2024-08-25 16:35:40	2024-08-25 16:35:40	NULL
prime-depot-415622	bellabeat	2024-03-20 20:06:32	2024-03-20 20:06:32	NULL
prime-depot-415622	employee_data	2024-03-06 23:59:41	2024-03-06 23:59:41	NULL
prime-depot-415622	demos	2024-03-05 17:43:26	2024-03-05 17:43:26	NULL
prime-depot-415622	cars	2024-03-03 18:17:34	2024-03-03 18:17:34	NULL
prime-depot-415622	babynames	2024-02-28 17:06:55	2024-02-28 17:06:55	NULL
prime-depot-415622	BikeShare	2024-04-01 20:47:30	2024-04-01 20:47:30	NULL
prime-depot-415622	city_data	2024-02-28 16:34:15	2024-02-28 16:34:15	NULL
prime-depot-415622	customer_data	2024-03-03 16:27:32	2024-03-03 16:27:32	NULL

List Tables

Let’s see if city_data is the dataset we are looking for
Let’s list the tables in it

SELECT table_catalog, table_schema, table_name, table_type, is_insertable_into
FROM    city_data.INFORMATION_SCHEMA.TABLES

1 records
table_catalog	table_schema	table_name	table_type	is_insertable_into
prime-depot-415622	city_data	cities	BASE TABLE	YES

List Column Schema

Well that’s not it

SELECT table_name, column_name, ordinal_position, is_nullable, data_type 
FROM    city_data.INFORMATION_SCHEMA.COLUMNS

4 records
table_name	column_name	ordinal_position	is_nullable	data_type
cities	city_name	1	YES	STRING
cities	avg_temp	2	YES	INT64
cities	avg_commute	3	YES	INT64
cities	happiness_ranking	4	YES	INT64

BigQuery Connection

Well, as you saw above that wasn’t the correct dataset, so let’s setup a new connection and we’ll call it
con_ny

projectid <- 'prime-depot-415622'
datasetid <- 'ny_citibikes'
con_ny <- dbConnect((bigquery()),
                 project = projectid,
                 dataset = datasetid,
                 use_legacy_sql = FALSE)

List Tables

There is no need to list the datasets since we specified that in the connection

Reminder: We need to include #| connection = con_ny the name of the connection we established above in every code chunk we want to execute.

SELECT table_catalog, table_schema, table_name, table_type, is_insertable_into
FROM   ny_citibikes.INFORMATION_SCHEMA.TABLES

2 records
table_catalog	table_schema	table_name	table_type	is_insertable_into
prime-depot-415622	ny_citibikes	trips	BASE TABLE	YES
prime-depot-415622	ny_citibikes	stations	BASE TABLE	YES

List Column Schema

SELECT  column_name, ordinal_position, is_nullable, data_type 
FROM   INFORMATION_SCHEMA.COLUMNS
WHERE  table_name = 'stations'

Displaying records 1 - 10
column_name	ordinal_position	is_nullable	data_type
station_id	1	NO	STRING
name	2	YES	STRING
short_name	3	YES	STRING
latitude	4	YES	FLOAT64
longitude	5	YES	FLOAT64
region_id	6	YES	INT64
rental_methods	7	YES	STRING
capacity	8	YES	INT64
eightd_has_key_dispenser	9	YES	BOOL
num_bikes_available	10	YES	INT64

SELECT  column_name, ordinal_position, is_nullable, data_type 
FROM   INFORMATION_SCHEMA.COLUMNS
WHERE  table_name = 'trips'

Displaying records 1 - 10
column_name	ordinal_position	is_nullable	data_type
tripduration	1	YES	INT64
starttime	2	YES	DATETIME
stoptime	3	YES	DATETIME
start_station_id	4	YES	INT64
start_station_name	5	YES	STRING
start_station_latitude	6	YES	FLOAT64
start_station_longitude	7	YES	FLOAT64
end_station_id	8	YES	INT64
end_station_name	9	YES	STRING
end_station_latitude	10	YES	FLOAT64

EDA

Compare # Bikes

Avg # Bikes

Let’s start by calculating the avg number of bikes at each station
Avg = 14 bikes/station

SELECT
    AVG(num_bikes_available)
FROM
    ny_citibikes.stations 
 AS 
    avg_num_bikes_avail

1 records
f0_
14.04139

Compare Station/Avg

Let’s compare the average number per station to the average for all stations
Use the query above as a subquery
Remember the inner query is ALWAYS performed first

SELECT
        station_id,
        num_bikes_available,
        (SELECT
            AVG(num_bikes_available)
        FROM
            ny_citibikes.stations 
         AS 
            avg_num_bikes_avail)
FROM
        ny_citibikes.stations

Displaying records 1 - 10
station_id	num_bikes_available	f0_
66dce9b5-0aca-11e7-82f6-3863bb44ef7c	0	14.04139
66dd4ac1-0aca-11e7-82f6-3863bb44ef7c	0	14.04139
66dbcdfc-0aca-11e7-82f6-3863bb44ef7c	0	14.04139
66dd4b91-0aca-11e7-82f6-3863bb44ef7c	0	14.04139
1791472201147965920	0	14.04139
66dca5f3-0aca-11e7-82f6-3863bb44ef7c	0	14.04139
70634bc4-4d90-42d4-8f6a-3f29fa8f7f77	0	14.04139
66dd3d7f-0aca-11e7-82f6-3863bb44ef7c	1	14.04139
66dd4ccc-0aca-11e7-82f6-3863bb44ef7c	1	14.04139
8e631be4-1e0c-4627-a261-18f0003d0ba2	1	14.04139

Rides per Station

Calculate the number of rides for each station save values in num_of_trips
start_station_id is of type INT, we’ll convert it to STRING named: id_string
start_station_id has some rows with missing values, so we’ll focus our calculation on rows that have values by using IS NOT NULL
group the data by the newly created id_string

SELECT
     CAST(start_station_id AS STRING) AS id_string,
     COUNT(*)AS num_of_trips
FROM
     ny_citibikes.trips
WHERE
     CAST(start_station_id AS STRING) IS NOT NULL
GROUP BY
     id_string;

Displaying records 1 - 10
id_string	num_of_trips
483	209620
3581	5130
469	134058
531	173774
392	63209
3084	7150
3002	259381
3485	26
3531	2672
168	275134

Join Tables

I’ll use the start_station_name to match with name from the other table
Since we want these columns name, station_id, num_of_trips/rides_per_station we’ll insert them in the Outer query
NOTE: since we’ll be using start_station_name as the field to match, we have to add it to the INNER Query, so you’ll see it added right after the CAST in the second SELECT
You’ll also have to add it to the GROUP BY or you’ll get an error

SELECT
        name,
        station_id,
        num_of_trips AS rides_per_station
FROM
       (
        SELECT
             CAST(start_station_id AS STRING) AS id_string,
             start_station_name,
             COUNT(*) AS num_of_trips
        FROM
             ny_citibikes.trips
        WHERE
             CAST(start_station_id AS STRING)IS NOT NULL
        GROUP BY
             id_string,
             start_station_name
       )
       AS station_num_of_trips
INNER JOIN
       ny_citibikes.stations
       ON
       name = station_num_of_trips.start_station_name
ORDER BY
       num_of_trips DESC;

Displaying records 1 - 10
name	station_id	rides_per_station
E 17 St & Broadway	1975518396279321266	423334
W 21 St & 6 Ave	66dc120f-0aca-11e7-82f6-3863bb44ef7c	403795
West St & Chambers St	66dc0e99-0aca-11e7-82f6-3863bb44ef7c	384116
Lafayette St & E 8 St	66db65aa-0aca-11e7-82f6-3863bb44ef7c	372255
Broadway & E 14 St	66db6387-0aca-11e7-82f6-3863bb44ef7c	344546
Cleveland Pl & Spring St	66db2fd0-0aca-11e7-82f6-3863bb44ef7c	318700
W 41 St & 8 Ave	66dc3f08-0aca-11e7-82f6-3863bb44ef7c	311403
Carmine St & 6 Ave	66dbce8a-0aca-11e7-82f6-3863bb44ef7c	305874
Greenwich Ave & 8 Ave	66db631c-0aca-11e7-82f6-3863bb44ef7c	298162
University Pl & E 14 St	b30815c0-99b6-451b-be15-902992cb8abb	297737

VLookup in SQL

Reminder: VLookup is used to lookup a value for a variable from one table that appears in another table, when found we extract those rows from the other table

Write a query that returns a table containing two columns:
- the station_id and
- name (from the ny_citibikes.stations table) of
- only those stations that were used by people classified as subscribers, which is information found in the ny_citibike.trips table.

We’ll use subqueries to find

the rows in ~.trips table where the usertype = subscriber in the inner query
the outer query will take the list from the inner query and extracts all the rows that match the station_id of the rows extracted from with the inner query
The result will be zero
We need to find another column to extract the values

Note: station_id in stations table looks like this: 66dc120f-0aca-11e7-82f6-3863bb44ef7c while station_id in trips table looks like this: 561

SELECT
        station_id,
        name
FROM
        ny_citibikes.stations
WHERE
        station_id IN
                (
                SELECT
                        CAST(start_station_id AS STRING)
                FROM
                        ny_citibikes.trips
                WHERE
                        usertype = 'Subscriber'
                );

0 records
station_id	name

Avg Duration

Calculate the average trip duration per station
Calculate the difference from the average at each station
We want these columns displayed:
- station_id
- startime
- tripduration (rounded to 2 dec)
- avg_per_station
- dif_from_average
- displayed in descending order based on dif_from_average
The First and Second WHERE tells the query to link the start_station_id with the output of the query.
The third FROM clause pulls the data from the citibike_trips table and labels it as outer_tripsso we can access it from within the inner queries without having to repeat the long source of table.
What we are doing is calculating the AVG for each station_id and storing it in the outer_trips_station_id
Since we already calculated the avg per station in the above code all we need to do is create another inner subquery and subtract the tripduration from the previous subquery named avg_duration. So the simple version looks like this
ROUND(tripduration – avg_duration,2) AS duration_diff
so substitute the subquery avg_duration from section above in the statement and add it to the code above and you’ll get the complete code below with one other change, since we’ll ROUND the entire formula there is no need to ROUND the inner AVG calculation as well
change the ORDER BY to the dif_from_average

SELECT
        start_station_id,
        starttime,
        tripduration,
        (
        SELECT    ROUND(AVG(tripduration),2)
        FROM      ny_citibikes.trips
        WHERE     start_station_id = outer_query.start_station_id
        ) AS  avg_duration,
       ROUND (tripduration -
               (
                SELECT  AVG(tripduration)
                FROM    ny_citibikes.trips
                WHERE   start_station_id = outer_query.start_station_id
                ) ,2)   AS duration_diff
FROM
       ny_citibikes.trips AS outer_query
ORDER BY
       duration_diff DESC
LIMIT  50;

Displaying records 1 - 10
start_station_id	starttime	tripduration	avg_duration	duration_diff
3082	2018-01-22 18:20:27	19510049	2061.72	19507987
3349	2018-02-21 14:15:10	15962256	2908.95	15959347
3041	2018-03-15 18:21:38	15020934	11295.05	15009639
3042	2018-02-21 15:30:02	13931824	14540.53	13917283
3042	2018-02-12 15:38:54	13586276	14540.53	13571735
3383	2018-03-11 03:52:30	12479323	1780.86	12477542
3064	2018-02-08 21:46:47	11749576	2566.28	11747010
3537	2018-01-28 03:51:28	11699746	3812.10	11695934
501	2018-01-27 15:37:18	11138807	1192.56	11137614
343	2018-05-05 13:46:13	10283682	1623.92	10282058

Top 5 Avg Duration

Filter the data to include only the trips from the five stations with the longest mean trip duration.
You only want the records where the start_station_id matches one of the top five stations with the greatest averagetripduration.
This is another example similar to VLOOKUP as we did two sections above

Steps

Start with the inner most query:
- Calculate the average trip duration per station
- Group by station id
- Sort in descending order
- Limit the results to the top 5
- If you run this inner query now, you’ll get a table with one column named: start_station_id of the top 5 station
- In order to display other columns we need to take the results we just obtained and pass it to the outer query which contains the additional columns we want
Move out to the outer query:
- Select the columns we want to display in the final result
- Use the inner query of the top 5 station_ids to search VLOOKUP the dataset for rows that match those ids
- The result will contain the stations with the highest average trip duration
- Using the WHERE and IN

SELECT
        start_station_id,
        starttime,
        tripduration,
FROM
        ny_citibikes.trips
WHERE
        start_station_id IN
        (

        SELECT
                start_station_id
        FROM
                (    
                 SELECT
                        start_station_id,
                        AVG(tripduration) AS avg_duration
                FROM
                        ny_citibikes.trips
                GROUP BY
                        start_station_id
                ) AS top_five
        ORDER BY top_five.avg_duration DESC
        LIMIT 5
        );

Displaying records 1 - 10
start_station_id	starttime	tripduration
3590	2018-03-29 22:27:32	672
3649	2017-11-15 13:28:53	1488
3649	2017-11-16 18:15:53	2001
3649	2018-03-22 19:55:38	228
3649	2018-05-03 12:56:02	1639
3649	2018-04-29 12:08:58	687
3649	2018-05-28 23:03:18	121
3590	2018-03-16 16:03:56	1471
3649	2018-04-14 13:39:36	2595
3649	2018-04-05 10:03:10	1991