How dbt can help you become a better data analyst

Data Analytics Professional 📈 | Love scuba diving🤿 and always ready to explore the beauty of islands 🏝️

Introduction

Building the best cloud-native PMS isn’t easy, and the analytics surrounding it are no different. It has always been challenging to address product-related questions from our product managers, designers, and sometimes even developers, but that’s where dbt comes to the rescue. 

The modern data stack has revolutionized how data analysts work within an organization, and dbt has played a significant role in this transformation. Today, data analysts can apply software engineering best practices to their analytical assets, such as version control, quality assurance, documentation, and modularity. 

At Mews, dbt is a part of our data stack that helps data analysts and engineers build and maintain their analytics pipelines. After reading this article, you will understand how dbt can serve as a valuable tool and how you can benefit from it as a data analyst. 

Photo: Ufuk Ceyhanli via Midjourney

Log processing – Incremental models 

Having a substantial number of logs is almost inevitable with a digital product and a growing user base. Working with log datasets is essential for extracting valuable insights, but this can be bothersome since these datasets are often heavy and not readily analyzable, requiring additional transformations. Typically, only a fraction of the log data relates to your product area, so some filtering will also be necessary. 

Photo: Ufuk Ceyhanli

Using the table materialization option in dbt, you can simply materialize your product area records onto a new table and conduct your analysis on that table rather than logs. Since your data is precomputed and stored in a new table, your queries will run faster than running the entire transformation pipeline every time you need the data. It should also significantly reduce the dashboard refresh time once you start using it as a data source instead of the main logs table.

You should consider implementing this as an incremental model if the table creation takes a considerable amount of time. Incremental models only process new logs and add them to your table rather than recreating everything each time. 

Would you like to work with us?

We are growing and looking for new colleagues. Are you ready for the next career challenge?✌️

Keeping track of history – Snapshots 

Most of the dimensional tables in the operational database follow the Type 1 Slowly Changing Dimension (SCD) condition. This condition implies that when there is a change in the data, the existing record is replaced with a new one, without tracking the history. However, in certain cases, data analysts may find it necessary to track changes in records for product decisions. 

By using dbt’s snapshot function, you can capture a snapshot of a table at defined intervals and implement Type 2 SCDs. This provides insights into which records underwent changes during specific time periods.

Photo: Ufuk Ceyhanli

Enforcing data quality – Tests

While data analysts typically understand product needs and questions and translate them into SQL queries, the final artifact can sometimes differ from what was envisioned, leading to incorrect results and insights. dbt offers 4 out-of-the-box tests: unique, not_null, accepted_value, and relationships. Custom generic tests can also be implemented by simply creating a file defined in SQL within the test path of your dbt project. 

Here is a complete example using those generic tests on a model named payments:

version: 2

models:
  - name: payments
    columns:
      - name: payment_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['accepted', 'declined', 'in process']
      - name: store_id
        tests:
          - relationships:
              to: ref('stores')
              field: id

Implementing tests with dbt facilitates building the correct model from the start, minimizing errors. For example, the accepted_value test produces an error when a record contains a value not defined in the specified list for a column. The error will prompt you to notice any change in the business or engineering process and communicate with upstream engineers to validate the change and adapt your model if necessary.

Saving time – Macros

When working as a data analyst within an organization, you may quickly realize that there are very similar transformation steps applied to data sets across the organization. The most simplistic example could be obtaining domain information from users’ email addresses or converting one currency to another. Macros in dbt can help you build purpose-specific pieces of code that can be reused by others to save time.  

Let’s look at an example of a macro implementation.

Before employing a macro, a dbt model which utilizes the users table and adds an email domain column by transforming email addresses will resemble the following:

WITH users AS (
  SELECT
    user_id,
    email
  FROM {{ source('raw', 'users') }}
)
SELECT
  user_id,
  email,
  CASE
    WHEN POSITION('@' IN email) > 0 THEN SPLIT_PART(email, '@', 2)
  END AS email_domain
FROM users

A macro file that extracts domain information from email addresses, called get_domain_from_email, is created in the macros path of your dbt project.

{% macro get_domain_from_email(email_column) %}
  CASE
    WHEN POSITION('@' IN {{ email_column }}) > 0 THEN SPLIT_PART({{ email_column }}, '@', 2)
  END
{% endmacro %}

The model will look like this when the macro is used in the dbt model:

WITH users AS (
  SELECT
    user_id,
    email
  FROM {{ source('raw', 'users') }}
)
SELECT
  user_id,
  email,
  {{ get_domain_from_email('email') }} AS email_domain
FROM users

Once the macro is created, it can be used by any data analyst whenever they need to extract domain information from email addresses. This results in significant time savings, as analysts no longer need to write the code from scratch.

Conclusion

In this article, you have learned how dbt can be a valuable tool for constructing analytically ready tables for your product area, enhancing query performance, eliminating data quality issues with tests, and avoiding repetitions with macros.

In addition to the benefits mentioned above, utilizing dbt in your data stack can also improve collaboration between data analysts and engineers, streamline the development process of analytical pipelines, and enable easier maintenance and scalability of your data infrastructure. By leveraging the power of dbt, organizations can unlock the full potential of their data and make more informed product decisions.

Data Analytics Professional 📈 | Love scuba diving🤿 and always ready to explore the beauty of islands 🏝️
Share:

More About