Dude, bangin' tuples...

Whitney Wilger
Mar 24
2 min read

Updated: Mar 31

Okay, okay, that's not what dbt stands for. Data Build Tool, or dbt for short, is all the rage in the ELT world right now, and for good reason. Once upon a time, when ETL was first born and cloud storage was expensive as all get out, we performed transformations before we got to the data warehouse, likely with Spark or some other robust data engine. Now, with cloud storage being accessible to almost any business worth it's salt, transformations within the data warehouse are the way.

Enter dbt, a tool that places development in the hands of anyone who can write SQL, and for those more advanced engineers, you can create macros to repurpose code, handle multi-tenancy, write out to locations different than your source data (within reason), and create models until your heart is content.

Models, the foundational blocks of dbt, are stored in what is called Data Staging Layers; basically organizational folders. At the very basic level you can have a single data staging layer (not ideal), or two data staging layers, one for staging and one for production. As for me, I like to Marie Kondo my work, so I have six data staging layers: Ingest, Clean, Normalize, Integrate, Analyze, and Egest. I wont belabor you with what goes in each layer as I think the names are fairly self explanatory, and yes, you can put denormalized models in the Normalize layer.

Organizing your models really means something when you start having upwards of 50, 75, 100...models in your project, and not only that, it makes your code more portable and easier to onboard. No one wants to dissect your 4000 line stored procedure just to understand what is happening in a single column. Breaking models down into logical steps means that new developers can pick up where you left off, add changes whole lessening the blast radius, and testing low level changes without impacting clients who only see a view based off of your Egest layer models.

That's right, I don't let clients see anything in the data staging layers. Exposing your content via a view of the final model or models protects you somewhat like a semantic layer would--As long as you keep the column names the same, they will be no wiser. This comes in handy when you need to alter a calculation or algorithm, or simply correct a data mishap.

In the diagram above we are using macros to access multiple tenant databases, combining them into a single repository, and reporting on them as part of our Service Delivery Metrics. Jenkins is my orchestration tool of choice, namely because it was available to me, and as a dbt Core user, I need something I can schedule jobs, create dependencies, and integrate easily with GitHub actions. Jenkins is not everyone's cup of team, but I love it. Snowflake is out Data Warehouse of choice; it handles the dbt models with ease and speed. Whether we scale vertically or horizontally, Snowflake bangs out the work without fail. See what I did there, bang, bangin. Stay tuned for another blog where I break down the in's and out's of multi-tenancy in dbt.

Dude, bangin' tuples...

Recent Posts

Comments