It's been a hot minute since my last entry about AWS Glue and our project is moving along, almost fully in our production environment (the last step is turning on the movement into Snowflake so stay tuned)! It's been a long journey, but in our defense, it was myself and two contractors who eventually moved on, and then just me, and then myself and half the time of another developer. I often wonder if we are one of the smaller lake building teams--if you're reading this and doing similar work, feel free to drop a comment!
Okay, so on to the goods...
Bookmarks: Bookmarks are good and bad. The good, you don't have to manage the entries. The bad, you can't manage the entries. Let me explain: Let's say you are building your lake far after the database or data store was established and there is considerable data within. That first bookmark run, in terms of the SQL, is a less than where clause, e.g.
where row_modified_at < '2022-04-10 00:00:00'
This can be hugely problematic, or not, depending on the size of your data below that point. In addition, if you want to only process data from some date forward you don't have that choice because, as of this writing, you cannot set a bookmark, you can only retrieve them to see what they are currently. After running into this issue we decided to use DyanmoDB to store our bookmark entries. For jobs and tenant databases that have been run at this point, we grabbed the bookmark entires, they are stored in JSON, and parsed them into one entry per tenant database, per table, with the from and to timestamps. We did this with a simple Python script--very straightforward if you need to do the same thing. For jobs and tenant databases that had not been run up to this point, we wrote into our job scripts to use '1970-01-01 00:00:00' if there is no entry in DynamoDB--really, you could use any timestamp you choose, or the base of your desired historical run. Remember to disable bookmarks if you do choose to manage them on your own.
Cleanup: If your database has been around for a while and uses row modified stamps maintained by the server, you know that cleanup can leave you with huge chunks of data within just a small amount of time. If you are using those timestamps for your batches, and those batches are based on one larger table (this is what we chose to do), you can run into issues. We went through a few iterations where we tried re-batching based on tables with more data in the window than the batching table, however, this was just as problematic. We finally settled on an outer batch, managed by a Python orchestration script, then within our job, we further batch by a database identifier. This might mean your jobs will run longer, but it will guarantee that your database can handle the queries being thrown at it and won't take forever to return. We also implemented this in our differential, daily jobs, on the off chance someone dumps a huge amount of data on any given day. So far it is working great.
Historical Batching: As stated above, historical batching is tricky. You can't just tell a database with terabytes of data to give you everything--that's a recipe for a Sys Admin or DBA at your desk faster than you can... you get the idea. There are many ways to batch, by identifier, by some timestamp, etc. Knowing your data is critical in choosing which path to take, and sometimes it takes both. Explain plans are your friend, testing queries for run times, and trying the methodology glue uses for hash partitions is key. Technical debt complicates things further, but there seems to always be a way around it if you think outside the box.
Hash Partitions: Hash partitions are a way for Glue to break up the queries it sends to the data source. There are a few different ways this can be done, namely on an identifier or other numeric field, or on a date field. Depending on your data structure, you can use either, but the resultant queries will look something like this (keep in mind I am working from a MySQL source):
SELECT * FROM (select * from comment WHERE id % 1 = 3) as comment
SELECT * FROM (select * from comment WHERE id % 2 = 3) as comment
SELECT * FROM (select * from comment WHERE id % 3 = 3) as comment
What this is doing is splitting up your query by the hash partition, id in this case, into the number of partitions, the value after the equals sign, and assigning each a part, the value after the percent sign. This can also be impacted by a 'slimdown query' if you are reading from the jdbc, a la:
self.glue_context.read.jdbc(url=url_case, table=sql_case)
where sql case is some select statement. I can elaborate here further if need be; Just drop a comment down below. For complete transparency, I did run into a case where the performance of hash partitions actually degraded my query processing. Basically, the servers CPU was overrun by the queries being split into 20 parts versus just running them as a single query, i.e. a hash partition count of one. Keep that in mind while you are working with Glue: tweak anything and everything to test performance.
Lazy Data Processing: Spark (the engine on which Glue is run) is lazy. Let's say you are working along in your code but you don't do anything that would necessitate bringing the data in--Spark isn't going to touch that data until you do. For example, you might create your Dynamic or Data Frame, but until you call a count() or coalesce(), Spark won't bother brining that data into memory. This might confuse you when looking at your logs--the dreaded red herring. Glue can't always keep up with the print outs and might be behind where the actual processing is--sometimes you have to dig deeper and set up the Spark console to see where it is truly hung up. Also keep in mind, those count() type calls are costly--reserve them to only when absolutely needed.
Alternate Runtimes: Depending on if your source is multi-tenant, or even just different needs based on different database objects, you might have to process a few different workflows to accomplish the need. Don't feel like you have to create more jobs, in fact, you can create a generic workflow template and pass in parameters that simply point to existing jobs, assuming you have those jobs setup to meet both needs, i.e. table lists, begin and end timestamps, etc. This is extremely useful for us as we had many databases to move, and many clients with different reporting needs. You can see the architecture for this methodology in the diagram above.
Phew, and that's probably not even all of it, but that is the best I can do on a lazy, Sunday morning. If you are just starting a Data Lake build on AWS, consider Glue as an option. It does make management of the jobs fairly easy being that there is a nice console, but keep in mind you might need to get crafty if you have a lot of data to move, or even if you just want to future proof your system. Happy coding, y'all, and as always, reach out to me on LinkedIn if you want to hash ideas--I am always happy to banter!
Comments