AWS Glue 101: All you need to know with a real-world example

Full ETL Pipeline Explained.

Simple AWS-based ETL Pipeline
Simple AWS-based ETL Pipeline

What is AWS Glue?

So what is Glue? AWS Glue is simply a serverless ETL tool. ETL refers to (3) processes that are commonly needed in most Data Analytics / Machine Learning process: Extraction, Transformation, Loading. Extracting data from a source, transforming it in the right way for applications, and then loading it back to the data warehouse. And AWS helps us to make the magic happen. AWS console UI offers straightforward ways for us to perform the whole task to the end. No extra code scripts are needed.

Components of AWS Glue

  • Data catalog: The data catalog holds the metadata and the structure of the data.
  • Database: It is used to create or access the database for the sources and targets.
  • Table: Create one or more tables in the database that can be used by the source and target.
  • Crawler and Classifier: A crawler is used to retrieve data from the source using built-in or custom classifiers. It creates/uses metadata tables that are pre-defined in the data catalog.
  • Job: A job is business logic that carries out an ETL task. Internally, Apache Spark with python or scala language writes this business logic.
  • Trigger: A trigger starts the ETL job execution on-demand or at a specific time.
  • Development endpoint: It creates a development environment where the ETL job script can be tested, developed and debugged.

Why use AWS Glue?

How Glue benefits us? Here are some of the advantages using it in your own workspace or in the organization

  • Final processed data can be stored in many differnet places (Amazon RDS, Amazon Redshift, Amazon S3, etc)
  • It’s cloud service. No money needed on on-premises infrastructures.
  • It’s a cost-effective option as it’s a serverless ETL service
  • It’s fast. It gives you the Python/Scala ETL code right off the bat.

A Production Use-Case of AWS Glue

Here is a practical example of using AWS Glue.

Project walkthrough

For the scope of the project, we will use the sample CSV file from Telecom Churn dataset (The data contains 20 different columns. The objective for the dataset is a binary classification, and the goal is to predict whether each person would not continue to subscribe to the telecom based on several information about each person. Description of the data and the dataset itself can be downloaded from this Kaggle Link here).

1. Create an IAM role to access AWS Glue + EC2 + CloudWatch + S3

Image for post
Image for post
  • Click on RolesCreate Role.
  • Choose Glue service from “Choose the service that will use this role”
  • Choose Glue from the “Select your use case” section
  • Select “AWSGlueServiceRole” from the Attach Permissions Policies section.
  • Click on Next: Tags. Leave the Add tags section blank. Create role.
  • Your role now gets full access to AWS Glue and other services

2. Upload source CSV files to Amazon S3

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
  • Create a new folder in your bucket and upload the source CSV files
Image for post
Image for post

3. Start the AWS Glue Database

Image for post
Image for post

4. Create and Run Glue Crawlers

As we have our Glue Database ready, we need to feed our data into the model. So what we are trying to do is this: We will create crawlers that basically scans all available data in the specified S3 bucket. The crawler identifies the most common classifiers automatically including CSV, JSON, and Parquet.

Image for post
Image for post
  • Click the blue Add crawler button.
  • Make a crawler a name, and leave as it is for “Specify crawler type”
Image for post
Image for post
Image for post
Image for post
  • Leave the Frequency on “Run on Demand” now. You can always change to schedule your crawler on your interest later.
  • In Output, specify a Glue database you created above (sampledb)
Image for post
Image for post
  • Click the checkbox and Run the crawler by clicking Run Crawler
  • Once it’s done, you should see its status as ‘Stopping’. And ‘Last Runtime’ and ‘Tables Added’ are specified.
Image for post
Image for post

5. Define Glue Jobs

With the final tables in place, we know create Glue Jobs, which can be run on a schedule, on a trigger, or on-demand. The interesting thing about creating Glue jobs is that it can actually be an almost entirely GUI-based activity, with just a few button clicks needed to auto-generate the necessary python code. However, I will make a few edits in order to synthesize multiple source files and perform in-place data quality validation. By default, Glue uses DynamicFrame objects to contain relational data tables, and they can easily be converted back and forth to pyspark dataframes for custom transforms.

Image for post
Image for post
Image for post
Image for post
  • Select Spark for the Type and select Spark 2.4, Python 3 for Glue Version
  • You can edit the number of DPU (Data processing unit) value in the Maximum capacity field of Security configuration, script libraries, and job parameters (optional).
  • The remaining configuration is optional
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
  • The left pane shows a visual representation of the ETL process. The right-hand pane shows the script code and just below that you can see the logs of the running Job.
  • Save and execute the Job by clicking on Run Job.
Image for post
Image for post

6. Conclusion

To summarize, I’ve built a one full ETL process: we created a S3 bucket, uploaded our raw data to the bucket, started the glue database, added a crawler that browse the data in above S3 bucket, created a GlueJobs, which can be run on a schedule, on a trigger, or on-demand, and finally updated data back to the S3 bucket.

About the Author

HyunJoon is a Data Analyst at AtGames, a Game Software/Hardware company. He has degrees in Statistics from UCLA. He is a data enthusiast who enjoys sharing data science/analytics knowledge.
Follow him on LinkedIn.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store