AWS Glue 101: All you need to know with a real-world example
In this post, I will explain the design and implementation of the ETL process using AWS services (Glue, S3, Redshift). Anyone who does not have previous experience and exposure to the AWS Glue or AWS stacks should easily be able to follow through.
Before we dive into the walkthrough, let’s breifly answer (3) commonly asked questions:
So, what actually is AWS Glue? What are the features and advantages of using Glue? And What is the real-world scenario?
What is AWS Glue?
So what is Glue? AWS Glue is simply a serverless ETL tool. ETL refers to (3) processes that are commonly needed in most Data Analytics / Machine Learning process: Extraction, Transformation, Loading. Extracting data from a source, transforming it in the right way for applications, and then loading it back to the data warehouse. And AWS helps us to make the magic happen. AWS console UI offers straightforward ways for us to perform the whole task to the end. No extra code scripts are needed.
Components of AWS Glue
- Data catalog: The data catalog holds the metadata and the structure of the data.
- Database: It is used to create or access the database for the sources and targets.
- Table: Create one or more tables in the database that can be used by the source and target.
- Crawler and Classifier: A crawler is used to retrieve data from the source using built-in or custom classifiers. It creates/uses metadata tables that are pre-defined in the data catalog.
- Job: A job is business logic that carries out an ETL task. Internally, Apache Spark with python or scala language writes this business logic.
- Trigger: A trigger starts the ETL job execution on-demand or at a specific time.
- Development endpoint: It creates a development environment where the ETL job script can be tested, developed and debugged.
Why use AWS Glue?
How Glue benefits us? Here are some of the advantages using it in your own workspace or in the organization
- AWS Glue scan through all the available data with crawler
- Final processed data can be stored in many differnet places (Amazon RDS, Amazon Redshift, Amazon S3, etc)
- It’s cloud service. No money needed on on-premises infrastructures.
- It’s a cost-effective option as it’s a serverless ETL service
- It’s fast. It gives you the Python/Scala ETL code right off the bat.
A Production Use-Case of AWS Glue
Here is a practical example of using AWS Glue.
A game software produces a few MB or GB of user-play data daily. The server that collects the user-generated data from the software pushes the data to AWS S3 once every 6 hours (A JDBC connection connects data sources and targets using Amazon S3, Amazon RDS, Amazon Redshift, or any external database).
We, the company, want to predict the length of the play given the user profile. To perform the task, data engineering teams should make sure to get all the raw data and pre-process it in the right way. Glue offers Python SDK where we could create a new Glue Job Python script that could streamline the ETL. The code runs on top of the Spark (a distributed system that could make the process faster) which is configured automatically in AWS Glue. Thanks to spark, data will be divided into small chunks and processed in parallel on multiple machines simultaneously.
Extract — The script will read all the usage data from S3 bucket to a single data frame (you can think of a data frame in Pandas)
Transform — Let’s say that the original data contains 10 different logs per second on average. The analytics team wants the data to be aggregated per each 1 minute with a specific logic.
Load — Write the processed data back to another S3 bucket for the analytics team.
For the scope of the project, we will use the sample CSV file from Telecom Churn dataset (The data contains 20 different columns. The objective for the dataset is a binary classification, and the goal is to predict whether each person would not continue to subscribe to the telecom based on several information about each person. Description of the data and the dataset itself can be downloaded from this Kaggle Link here).
1. Create an IAM role to access AWS Glue + EC2 + CloudWatch + S3
You need an appropriate role to access the different services you are going to be using in this process. An IAM role is similar to an IAM user, in that it is an AWS identity with permission policies that determine what the identity can and cannot do in AWS. When you get a role, it provides you with temporary security credentials for your role session. You can find more about IAM roles here
- Open Amazon IAM console
- Click on Roles → Create Role.
- Choose Glue service from “Choose the service that will use this role”
- Choose Glue from the “Select your use case” section
- Select “AWSGlueServiceRole” from the Attach Permissions Policies section.
- Click on Next: Tags. Leave the Add tags section blank. Create role.
- Your role now gets full access to AWS Glue and other services
2. Upload source CSV files to Amazon S3
- On the Amazon S3 console, click on the Create a bucket where you can store files and folders.
- Enter a bucket name, select a Region and click on Next
- The remaining configuration settings can remain empty now. Click Next to create your S3 bucket.
- Create a new folder in your bucket and upload the source CSV files
- (Optional) Before loading data into the bucket, you can try to compress the size of the data to a different format (i.e Parquet) using several libraries in python
3. Start the AWS Glue Database
In order to add data to a Glue data catalog, which helps to hold the metadata and the structure of the data, we need to define a Glue database as a logical container.
So we need to initialize the glue database
4. Create and Run Glue Crawlers
As we have our Glue Database ready, we need to feed our data into the model. So what we are trying to do is this: We will create crawlers that basically scans all available data in the specified S3 bucket. The crawler identifies the most common classifiers automatically including CSV, JSON, and Parquet.
- On the left pane in the AWS Glue console, click on Crawlers -> Add Crawler
- Click the blue Add crawler button.
- Make a crawler a name, and leave as it is for “Specify crawler type”
- In Data Store, choose S3 and select the bucket you created. Drill down to select the read folder
- In IAM role, choose the role you created above
- Leave the Frequency on “Run on Demand” now. You can always change to schedule your crawler on your interest later.
- In Output, specify a Glue database you created above (sampledb)
- Then, a Glue Crawler that reads all the files in the specified S3 bucket is generated
- Click the checkbox and Run the crawler by clicking Run Crawler
- Once it’s done, you should see its status as ‘Stopping’. And ‘Last Runtime’ and ‘Tables Added’ are specified.
- Then, Databases → Tables on the left pane let you verify if the tables were created automatically by the crawler.
5. Define Glue Jobs
With the final tables in place, we know create Glue Jobs, which can be run on a schedule, on a trigger, or on-demand. The interesting thing about creating Glue jobs is that it can actually be an almost entirely GUI-based activity, with just a few button clicks needed to auto-generate the necessary python code. However, I will make a few edits in order to synthesize multiple source files and perform in-place data quality validation. By default, Glue uses DynamicFrame objects to contain relational data tables, and they can easily be converted back and forth to pyspark dataframes for custom transforms.
Note that at this step, you have an option to spin up another database (i.e. AWS RedShift) to hold final data tables if the size of the data from the crawler gets big. For the scope of the project, we skip this and will put the processed data tables directly back to another S3 bucket
- In the left pane, click on Jobs, then click on Add Job
- Give a name and then select IAM role previously created for AWS Glue
- Select Spark for the Type and select Spark 2.4, Python 3 for Glue Version
- You can edit the number of DPU (Data processing unit) value in the Maximum capacity field of Security configuration, script libraries, and job parameters (optional).
- The remaining configuration is optional
- Choose a data source table from Choose a data source section. You can choose only a single data source.
- Add a JDBC connection to AWS Redshift. We need to choose a place where we would want to store the final processed data. You can choose your existing database if you have. Or you can re-write back to S3 cluster. For this tutorial, we are going ahead with the default mapping. The business logic can also later modify this.
- Open the Python script by selecting the recently created job name. Click on Action -> Edit Script.
- The left pane shows a visual representation of the ETL process. The right-hand pane shows the script code and just below that you can see the logs of the running Job.
- Save and execute the Job by clicking on Run Job.
- We get history after running the script, and get the final data populated in S3 (or data ready for SQL if we had Redshift as the final data storage)
To summarize, I’ve built a one full ETL process: we created a S3 bucket, uploaded our raw data to the bucket, started the glue database, added a crawler that browse the data in above S3 bucket, created a GlueJobs, which can be run on a schedule, on a trigger, or on-demand, and finally updated data back to the S3 bucket.
The additional work that I could maybe do is to revise a Python script provided at the GlueJob stage, based on your own business needs.
(i.e improve the pre-process to scale the numeric variables)
So this is pretty much how you could productionzie the data pipeline using AWS Glue.
About the Author
HyunJoon is a Data Analyst at AtGames, a Game Software/Hardware company. He has degrees in Statistics from UCLA. He is a data enthusiast who enjoys sharing data science/analytics knowledge.
Follow him on LinkedIn.