K-Means Clustering in Google BigQuery ML

A complete guide on the most popular and practical clustering technique natively in Google BigQuery (data+ML)

Image for post
Image for post
Get this image on Wikimedia Commons | License details
Image for post
Image for post
License details Creator: Decipher Zone Softwares Copyright: Created by Decipher Zone Softwares (Free to use content)

This article is for you, if you are:

This article is NOT for you, if you are:

Image for post
Image for post
Photo by Hasin Hayder on Unsplash

Motivation

Image for post
Image for post
License details Creator: structuresxx | Credit: Getty Images/iStockphoto

What is Google BigQuery?

Table of contents

How it works

1. Create a new GCP project

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

2. Load the demo data

Image for post
Image for post
Image for post
Image for post

3. Create a dataset (which stores the model)

Image for post
Image for post
Image for post
Image for post

4. Examine the dataset

Image for post
Image for post
Image for post
Image for post

5. Write a SQL Query

with hs as (
select
h.start_station_name as station_name
,if (extract(dayofweek from h.start_date) = 7 or
extract(dayofweek from h.start_date) = 1,
"weekend","weekday") as isweekday
,h.duration
,st_distance(st_geogpoint(s.longitude, s.latitude), st_geogpoint(-0.1, 51.5)) / 1000 as distance_from_city_center
from `bigquery-public-data.london_bicycles.cycle_hire`as h
join `bigquery-public-data.london_bicycles.cycle_stations`as s
on h.start_station_id = s.id
where h.start_date between cast('2015-01-01 00:00:00' as timestamp) and
cast('2016-01-01 00:00:00' as timestamp)
),
stationstats as (
select
station_name
,avg(duration) as duration
,count(duration) as num_trips
,max(distance_from_city_center) as distance_from_city_center
from hs
group by station_name
)

select *
from stationstats
order by distance_from_city_center
Image for post
Image for post

6. Create a K-means cluster model

create or replace model
k_means_tutorial.london_station_clusters OPTIONS(model_type='kmeans',num_clusters=4) as
with hs as (
select
h.start_station_name as station_name
,if (extract(dayofweek from h.start_date) = 7 or
extract(dayofweek from h.start_date) = 1,
"weekend","weekday") as isweekday
,h.duration
,st_distance(st_geogpoint(s.longitude, s.latitude), st_geogpoint(-0.1, 51.5)) / 1000 as distance_from_city_center
from `bigquery-public-data.london_bicycles.cycle_hire`as h
join `bigquery-public-data.london_bicycles.cycle_stations`as s
on h.start_station_id = s.id
where h.start_date between cast('2015-01-01 00:00:00' as timestamp) and
cast('2016-01-01 00:00:00' as timestamp)
),
stationstats as (
select
station_name
, isweekday
,avg(duration) as duration
,count(duration) as num_trips
,max(distance_from_city_center) as distance_from_city_center
from hs
group by station_name, isweekday
)
select * except(station_name, isweekday)
from stationstats
order by distance_from_city_center
create or replace model
bqml_tutorial.london_station_clusters options (model_type='kmeans',num_clusters=4) as
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

7. Predict the cluster using ‘ml.predict’ function

with hs as (
select
h.start_station_name as station_name
,if (extract(dayofweek from h.start_date) = 7 or
extract(dayofweek from h.start_date) = 1,
"weekend","weekday") as isweekday
,h.duration
,st_distance(st_geogpoint(s.longitude, s.latitude), st_geogpoint(-0.1, 51.5)) / 1000 as distance_from_city_center
from `bigquery-public-data.london_bicycles.cycle_hire`as h
join `bigquery-public-data.london_bicycles.cycle_stations`as s
on h.start_station_id = s.id
where h.start_date between cast('2015-01-01 00:00:00' as timestamp) and
cast('2016-01-01 00:00:00' as timestamp)
),
stationstats as (
select
station_name
, isweekday
,avg(duration) as duration
,count(duration) as num_trips
,max(distance_from_city_center) as distance_from_city_center
from hs
group by station_name, isweekday
)
select * except(nearest_centroids_distance)
from ml.predict(
model k_means_tutorial.london_station_clusters,
(
select *
from stationstats
)
)
Image for post
Image for post

8. Business insights / Conclusion

Image for post
Image for post

9. Delete the dataset

Image for post
Image for post

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store