K-Means Clustering in Google BigQuery ML
A complete guide on the most popular and practical clustering technique natively in Google BigQuery (data+ML)

In this post, the implementation of K-means clustering in Google Cloud Platform (BigQuery) is shown in detail. Anyone who does not have previous experience and exposure to the Google Cloud stacks should be able to follow through easily.

This article is for you, if you are:
- Curious about the most common business problem and tools you can apply to solve it
- Stuck in coding K-Means clustering in either Python or R, and want to explore
- Zero to less experience in Google Cloud Platform
- Unsure about the full K-means clustering step-by-step walkthrough
(define the problem, create a model, interpretation) - Want to learn cool data-science stuff (and actually useful)
This article is NOT for you, if you are:
- Trying to get a basic dictionary understanding of k-means clustering
- Already an expert in ML
- Want to learn Python/R
- Want to code
But for those who want to refresh their brain on what k-means clustering was,
K-means Overview
- The process begins with k centroids initialized at random.
- These centroids are used to assign points to their nearest cluster.
- The mean of all points within the cluster is then used to update the position of the centroids.
- The above steps are repeated until the values of the centroids stabilize.
[Reference](https://blog.cambridgespark.com/how-to-determine-the-optimal-number-of-clusters-for-k-means-clustering-14f27070048f)
Ok. Let’s dive in
Motivation
If you are currently working as a data analyst/scientist/business analyst or if you are an aspiring data professional, you have probably ever encountered (or heard of) classification or clustering problem at least once. It is one of the most common problems in real-world business cases, and also has the most powerful influence on various business decisions (based on my experience).
And still, in many business organizations, clustering is often performed by the best man’s guess (senior members or executives). For example, a monthly marketing budget is simply decided based on the age group of the audience, which (i.e. regardless of the spending habit, visit frequency). And as a data professional, that’s a no-no.
We want to use data science to guide the team to make a more sophisticated guess.

What is Google BigQuery?
Google BigQuery can be a big help.
I guarantee that your learning of a new way can be beneficial in applying the same concept in high-level programming languages (so please read through)
In most digital organizations, each has its own database that stores some kind of data and many have implemented Google BigQuery as their main SQL database. So it is very useful to learn the capability/scalability of Google BigQuery, which can be extended in so many different ways.
BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a Platform as a Service that supports querying using ANSI SQL. It also has built-in machine learning capabilities
This article summarizes the official document on [Google Cloud website](https://cloud.google.com/bigquery-ml/docs/kmeans-tutorial)
For those who merely consider Google BigQuery as a database, you might be surprised.
Table of contents
The pre-requisites of this article could maybe SQL, but no need.
- Create a new GCP project
- Load the demo data
- Create a dataset (which stores the model)
- Examine the data
- Write a SQL query
- Create a K-means cluster model
- Predict the cluster using ‘ml.predict’ function
- Business insights / conclusion
- (optional) Delete the dataset
How it works
1. Create a new GCP project

If you do not already have a Google Cloud Platform account, you can easily create one. Follow the steps after you click ‘get started for free’ on the Google Cloud website. It’s pretty straightforward.

Then, Log-in to [GCP Console](https://console.cloud.google.com/home), and you should see something like this

I’m going to create a new project here for this walkthrough. Click “Choose project’ on the upper-left corner, and click ‘New Project’ inside.

Name your project and click ‘Create’. Then the project is created. You can check basic information about your project in a dashboard format.
We are going to use BigQuery, so find and click ‘Big Data’ → ‘BigQuery’ in the left side pane.

If you are first-time using the BigQuery in Google Cloud Platform, you would see this. Click ‘Enable’
2. Load the demo data
BigQuery has a number of demo datasets that are free-to-use for everyone. In this specific example, we will use ‘London Bicycle Hire’ dataset to construct K-means clustering.

First, find “+ADD DATA” in the left pane and click ‘Explore public datasets’

Search for “London Bicycle Hires” and click “View dataset”
Then you could see “bigquery-public-data” DB added in the bottom-left corner. If you see london_bicycles when scrolling down, then we are good-to-go!
3. Create a dataset (which stores the model)
In order to create a K-means clustering model in BigQuery, we need to create a ‘Dataset’ that would save the model we will build.

To do this, click the project ID you created in step 1 (in my case my-project 0623), and click ‘CREATE DATASET’ on the right side of the monitor

Put k_means_tutorial in Dataset ID, and make sure to choose for ‘EU’ in Data location (London bicycle data is stored in EU multi-region, so this dataset should also be located in the same region). Leave other settings as it is, and click ‘Create dataset’.
4. Examine the dataset
As mentioned above, the dataset we are going to use is london_bicycles
dataset, and we are going to cluster the bicycle station by three of the following characteristics:
- Rental length
- Number of daily rentals
- Distance from the city
london_bicycles
dataset has two tables (cycle_hire and cycle_stations). And you can click each dataset to see the columns each table has.
cycle_hire: a rental table that has rental_id and bike_id as its key, and has duration and start/end station information for each bike rental

cycle_stations: data for the bike rental station (longitude/latitude and number of bikes for each station)

5. Write a SQL Query
The variables which we are going to use to cluster the stations are
- Rental length
- Number of daily rentals
- Distance from the city
Hover your mouse over to k_means_tutorial project we created, and copy+paste the below SQL query into the ‘query editor’, and click ‘Run’
with hs as (
select
h.start_station_name as station_name
,if (extract(dayofweek from h.start_date) = 7 or
extract(dayofweek from h.start_date) = 1,
"weekend","weekday") as isweekday
,h.duration
,st_distance(st_geogpoint(s.longitude, s.latitude), st_geogpoint(-0.1, 51.5)) / 1000 as distance_from_city_center
from `bigquery-public-data.london_bicycles.cycle_hire`as h
join `bigquery-public-data.london_bicycles.cycle_stations`as s
on h.start_station_id = s.id
where h.start_date between cast('2015-01-01 00:00:00' as timestamp) and
cast('2016-01-01 00:00:00' as timestamp)
),
stationstats as (
select
station_name
,avg(duration) as duration
,count(duration) as num_trips
,max(distance_from_city_center) as distance_from_city_center
from hs
group by station_name
)
select *
from stationstats
order by distance_from_city_center
You should see something like below in the ‘Query results’ section

I’m not going to go over the detail of SQL queries as this is not a SQL tutorial post. Briefly,
- We created two temporary tables ‘hs’ and ‘stationstats’ as subquery.
- ‘hs’ gives the information in each row for stations in 2015 (name, day of the week, the duration of the rental, distance from city)
- Then in ‘stationstats’ table, we use aggregate functions to calculate some important stats stored in ‘hs’ table
6. Create a K-means cluster model
Now we understand the basic logistics of BigQuery, so we can create a K-means clustering model.
We can make a create model
and with model_type = ‘kmeans’ we can train the clustering model.
Do copy+paste below again, and click ‘Run’
create or replace model
k_means_tutorial.london_station_clusters OPTIONS(model_type='kmeans',num_clusters=4) as
with hs as (
select
h.start_station_name as station_name
,if (extract(dayofweek from h.start_date) = 7 or
extract(dayofweek from h.start_date) = 1,
"weekend","weekday") as isweekday
,h.duration
,st_distance(st_geogpoint(s.longitude, s.latitude), st_geogpoint(-0.1, 51.5)) / 1000 as distance_from_city_center
from `bigquery-public-data.london_bicycles.cycle_hire`as h
join `bigquery-public-data.london_bicycles.cycle_stations`as s
on h.start_station_id = s.id
where h.start_date between cast('2015-01-01 00:00:00' as timestamp) and
cast('2016-01-01 00:00:00' as timestamp)
),
stationstats as (
select
station_name
, isweekday
,avg(duration) as duration
,count(duration) as num_trips
,max(distance_from_city_center) as distance_from_city_center
from hs
group by station_name, isweekday
)
select * except(station_name, isweekday)
from stationstats
order by distance_from_city_center
This SQL query differs from the previous one in just two parts.
- We added a line to create a model in the k_means_tutorial dataset we made above, and the result is stored as
london_station_clusters
create or replace model
bqml_tutorial.london_station_clusters options (model_type='kmeans',num_clusters=4) as
- Added a new column ‘isweekday’ in ‘groupby’ query in stationstats table, and also added
except (station_name, isweekday)
.For this, we want to see how weekday/weekend affects the rental rate. Also, Except (column) query excludes the column names inside the parenthesis

Once done, you should see somethig like this. Click ‘Go to model’, then you can check the detail.

If you click ‘Schema’, you can see the model is trained using three columns (duration, num_trips, distance_from_city_center)

If you click ‘Evaluation’, you can see each cluster. We have 4 clusters since we set num_clusters
를 as 4 in create model
, and we see ‘centroid value’.
7. Predict the cluster using ‘ml.predict’ function
We are going to use the built-in ‘ml.predict’ function to find the cluster that a certain station belongs to.
with hs as (
select
h.start_station_name as station_name
,if (extract(dayofweek from h.start_date) = 7 or
extract(dayofweek from h.start_date) = 1,
"weekend","weekday") as isweekday
,h.duration
,st_distance(st_geogpoint(s.longitude, s.latitude), st_geogpoint(-0.1, 51.5)) / 1000 as distance_from_city_center
from `bigquery-public-data.london_bicycles.cycle_hire`as h
join `bigquery-public-data.london_bicycles.cycle_stations`as s
on h.start_station_id = s.id
where h.start_date between cast('2015-01-01 00:00:00' as timestamp) and
cast('2016-01-01 00:00:00' as timestamp)
),
stationstats as (
select
station_name
, isweekday
,avg(duration) as duration
,count(duration) as num_trips
,max(distance_from_city_center) as distance_from_city_center
from hs
group by station_name, isweekday
)
select * except(nearest_centroids_distance)
from ml.predict(
model k_means_tutorial.london_station_clusters,
(
select *
from stationstats
)
)
The above query has two parts
- K-means clustering model we created
- Test-set data (for prediction)
We added except(nearest_centroids_distance)
to only look at the predicted clusters

We can see in which clusters each row belongs to by centroid_id
column
8. Business insights / Conclusion
Imagine you are in a data science team in the London city managing a bike rental, and have to come up with a project plan based on this clustering model
Let’s look at the Evaluation chart again

From the result we had, we can draw the following conclusions:
- Cluster #1 is far from the city center, and longer avg. rental duration
- Cluster #2 is closest to the city and less crowded
- Cluster #3 has the highest number of rentals (trips) and close to the city → This would be the most popular station
- Cluster #4 is far from the city but has a shorter rental length than cluster #1
Therefore, we can probably take the following business actions:
- For a Q1 expansion project to put the racing bike to certain stations, cluster #1 would be the best group to test out because it is farthest from the city and thus has the longest rental duration
- For making a deep dive on analyzing the rental pattern in London, we could prioritize cluster #3 because it is closest to the city and has the highest number of rentals recorded
The above suggestions are already powerful enough to start, which can be drawn from basic information that the GCP UI gives, and more advanced analysis is omitted for demonstration purposes.
9. Delete the dataset
To prevent any kind of unexpected further charge for the works done in this tutorial, let’s delete the dataset from the project.

Go to your project ID, and click ‘DELETE DATASET’ on your right
Thank you so much for reading this post! If you believe the content is fruitful, I would appreciate any support (sharing/liking) or the clap (👏🏼) button very below.
About the Author
HyunJoon is a Data Analyst. He has a degree in Statistics from UCLA. He is a data enthusiast who enjoys sharing data science/analytics knowledge.
Connect him on LinkedIn.