AWS Glue: The Importance of Parallel Processing

Whitney Wilger
May 18, 2021
3 min read

AWS Glue AWS Data Movement ETL ELT S3 Data Lake Crawler Job PySpark Spark IAM Queue SQS Lambda — AWS Based SaaS Data Movement

Today Whit dabbles in AWS Glue. If you recall from our last blog, my team is working on creating a data lake and our first stab [single POC source] is the above flow for a multi-tenant application store. One of the major hurdles we encountered, but didn't entirely expect straight off, was performance tuning in AWS Glue. In talking to AWS experts, they expect that processing time should be relatively linear when it comes to the amount of data you are processing. Of course there are the exceptions for small batches due to the amount of overhead when running a Glue job, but if you run into lengthy executions, it is time to start digging.

If you aren't familiar, Glue is a fully-managed data movement service, backed by Spark (we are using Apache Spark type jobs and so any reference to a job will be of that type), and hosted by AWS. There is the concept of a crawler which catalogs your schema, a job which does the actual data movement, and a workflow to coordinate it all (previously I used Step Functions for this but checking for completion before moving to a next step can involve a lot of Lambda executions--this is much easier and more cost effective). Glue jobs allow you to write the code in PySpark of Scala, the former being our flavor of choice, and there is a lot of tuning within your scripts to make things faster.

Glue uses a concept called DPU, or Data Processing Unit, which depending on the type of Worker you choose and the Glue version you are running on, grants you a certain amount of memory and a certain number of CPU's. You can think about DPU like nodes in other systems. There will always be a leader node and executor node(s), hence the reason for the minimum DPU being two. In our case, we are using 12 DPU and in the image below you will see 11 executors and 1 driver, or leader node.

When we first attempted to run the Incremental job for a single table with 186GB and 12 DPU we were met with 20 hour plus run times (we killed it off before it had a chance to finish) and a bit of investigation showed that the majority of the processing time was in creating the dynamic frame (an AWS variation on Data Frames, tuned for their services) and writing the dynamic frame back out to S3. This was because we weren't taking advantage of Glue's parallel processing via hash field/

expressions and has partitions. The hash field OR hash expression (never both) are the field you wish to base your parallel processing upon. You would use a hash field if there is a high cardinality field in your dataset or you can defer to a hash expression which will base the parallelization upon the rows of a numeric type field. The hash partition is the number of breakdowns you wish to achieve--this is similar to subprocess based multiprocessing or thread based threading module in Python. In our case we needed to process multiple tables that did not necessarily have a good, high cardinality field, but they all do have a numeric identifier, or ID, field.

self.hashexpression = "id"
self.hashpartitions = "10"
        
if self.hashexpression:
        additional_options["hashexpression"] = self.hashexpression
if self.hashpartitions:
        additional_options["hashpartitions"] = self.hashpartitions

datasource0 = self.glue_context.create_dynamic_frame.from_catalog(
        database=self.source.database,                                                                                                   
        table_name=self.source.table_name,                                                                                              
        catalog_id=self.source.catalog_id,                                                                                        
        additional_options=additional_options,                                                                                 
        transformation_ctx="datasource0_" + self.target.table_name)

As you can see, setting these values is as simple as adding additional options on your create_dynamic_frame function (same goes for write_dynamic_frame function).

This is a guaranteed speed up for your read and writes. We have more than halved our run times just by implementing this single feature. A word of warning: Be cautious about strictly using the documentation as it doesn't necessarily showcase these additional options, especially for the write_dynamic_frame. I often find that searching google, checking stack overflow, and hitting up the AWS documentation gives me a broader basis for my knowledge and we can always glean insight from those who came before us.

Cheers, thanks for reading, and HAPPY CODING!

AWS Glue: The Importance of Parallel Processing

Recent Posts

Comments