AWS Glue
Table of contents
General
AWS Glue is a serverless data integration service used for data discovery, preparation, cleansing, transformation, and integration from multiple sources.
As it is based on Spark, an anti-pattern would be using for projects requiring multiple ETL engines such as Hive, Hadoop, etc; in which case Amazon EMR would be better.
Using Glue for processing Streaming data used to be an anti-patter, but it is no longer the case, as it can consume Kinesis or Apacha Kafka thanks to Apache Spark Structured Streaming.
Glue Components
Data Catalog
Data Catalog is a centralized metadata catalogue that allows data to be queried from Amazon Athena, Amazon EMR (with SQL-like queries with Hive), Amazon Redshift Spectrum and Quicksight, and it is important in a Data Lake to allow unstructured data to be queried.
The Data Catalog contains information such as:
- Database Name
- Table Name
- Column Names
- Data Types
- Data Partitioning - Based on how data is stored in S3, can cause big impact if not defined well. For example, suppose you receive files from different partners every day. If you are most likely explore data per partner, partitioning the data as
partner=A/year=X/month=Y/day=Z
will perform better than partitioning asyear=X/month=Y/day=Z/partner=A
The data itself is not part of the Data Catalog.
The Glue Data Catalog is compatible with query in Hive and vice-versa.
Crawlers
Crawlers are processes used for scanning structured and unstructured data and then uses the gathered information to populate or update the Data Catalog.
Sources can be:
- S3
- RDS
- Redshift
- DynamoDB
- JDBC: Java Database Connectivity
- Glue Data Catalog
Crawlers can be scheduled, triggered on-demand or triggered as part of a Glue Workflow.
It is possible to create custom classifiers for crawling custom data types.
Glue Jobs
A fully managed process that is used for ETL. Can be a pure Python, Scala or a PySpark script. Can be event-driven and can do ETL on streaming data by using continuously-running jobs. In some circumstances, it is possible to update the catalog directly via the Glue Job, without requiring a crawler to be re-executed.
DPUs, or Data Processing Units, can be increased to increase the compute capability of each node. In addition, it is possible to add more nodes to job and it can be scaled on-demand.
Has at rest and in transit data encryption.
Jobs can be scheduled or triggered by Glue Triggers.
Job Bookmark
Glue Jobs have an optional feature called Bookmarks. This basically persists the state of the job, preventing the same data getting processed again and allowing you to start the next execution exactly where you ended the last one.
In relational databases, job bookmarks can only keep track of new rows. It cannot keep track of updates.
Workflows
A Glue-only orchestrating component. If any other integration is required, such as a Lambda Function, for example, then Glue Workflow cannot be used.
The workflows can be triggered by schedule, on demand or by EventBridge events with optional batching available
EventBridge is the only non-Glue component that offers integration with AWS Glue workflows.
Glue Studio
Glue Studio is a visual interface for building complex ETL DAGs (Directed Acyclic Graph, or Workflows).
Sources can be:
And possible targets are:
- S3 with partitioning support.
- Glue Data Catalog
The studio also provides a dashboard for job overview, status, etc.
Data Quality
It is possible to add to a job/workflow a Data Quality step for ensuring the data meet some pre-established parameters. In case of failure, you can fail the job or integrate with other services like Cloudwatch to report the unexpected behaviour.
The rules used for the Data Quality check can be defined manually by using the DQDL (Data Quality Definition Language) or automatically, where Glue will try to find the patterns on a sample data and create some expected rules.
Glue Streaming
Glue Streaming enables you to process streaming data in near real-time in a serverless way and with autoscaling using continuous-running jobs from the following data sources:
- Kinesis
- Amazon MSK (Managed Streaming for Apache Kafka)
- Self-managed Apache Kafka
DynamoDB Streams is not compatible.
Use cases include:
- Near-real-time data processing
- Fraud detection
- Social media analytics
- Internet of Things (IoT) analytics
- Clickstream analysis
- Log monitoring and analysis
- Recommendation systems
It uses checkpoints to control what has already been processed. Job Bookmark is not compatible.
Accessing data via VPC
If a Job need access to a VPC, for example for accessing a JDBC data store, Glue will create and attach an Elastic Network Interface - ENI to the job, and the ENI will have a private IP address. There is no public IP address., meaning it cannot access public internet. Security Group configurations are also applied to the ENI.
If a Job need access to the public internet, then it is required to create a NAT Gateway in the VPC.
It is possible to use Gateway VPC Endpoint for S3 to ensure data does not go through the public internet.
Cross-Job Data Access
If two jobs are running in the same VPC, they may have access to the data from each other. To avoid this from happening, a different secutiry configuration must be set for each job.
Costs
Generally speaking, you pay for what you use.
For development endpoints, however, AWS charges by the minute whenever the endpoint is up. That said, be sure to kill the endpoint whenever it is not in use.