How to effortlessly unlock the data in your data lake

Head of Data Science @MewsSystems; Former PhD student in CS Machine Learning @Matfyz #dataengineering #learningfromdata #machinelearning #javascript #python

Do you understand the usage of your products and how new features are being adopted? Can your product teams really make data-driven decisions? Does your team know how users are using or even misusing your product? The goal is to give our product teams the right set of tools to understand the impact of their product delivery and to navigate all their product opportunities.

Having a lot of unstructured data lying around in your blob storages or S3 buckets might give you a false sense of control over the data and insights you can derive. What you will find out is that being a data analyst with access to this would drive you crazy. So how do we get from a lot of JSON and Parquet files to deliver value to our data consumers?

In this post, I will share with you a high-level overview of how you can unlock the data for your teams using Databricks and Lakehouse. We will deep dive into each topic in future blog posts. Now, sit tight and let me break down the five straightforward steps we took to integrate Databricks as our core data solution tool.

Why Databricks?

Earlier in 2021, we explored several future data solutions, and after building some concept proofs, we opted for Databricks. What we really enjoyed about the platform and ultimately why we decided on it is its code-first approach and the fact that we can build it around our engineering processes with GitHub, providing continuous integration, deployment and code reviews.

Additionally, since we’re already in the Azure cloud, having Azure Databricks as a managed solution where we can use our premium support from Microsoft is a valuable benefit, alongside direct integration with Azure Active Directory.

Finally, the speed at which Databricks release innovative new features, all of which are individually worth celebrating, is a feat in and of itself. During our research, we’ve already had the opportunity to test Databricks SQL, use multitask jobs, and test live tables – and we’re really looking forward to the release of Unity Catalog in line with serverless SQL endpoints.

All in all, Databricks comes as a nice package with all the tools you need and best practices for how to use them.

How to unlock data with Databricks?

You can unlock your data using the following five steps:

  1. Make Delta the default format for your data
  2. Automate repetitive cleaning tasks
  3. Provide a seamless onboarding process with Azure Active Directory
  4. Get the most out of Databricks SQL
  5. Profit
Source: https://delta.io/
Source: https://delta.io/

1. Make Delta the default format for your data

The key component that brought us to consider Databricks as the primary solution for our data transformation and processing is the Delta format. It gives you reliability and robustness you would typically miss when operating with an abundance of files asynchronously in your data lake.

Here are some of the aspects of Delta we enjoy the most, just to name a few:

  • ACID (atomicity, consistency, isolation, durability) transactions at the table level,
  • Time travelling,
  • Metadata.

2. Automate repetitive cleaning tasks

By automating repetitive data cleaning tasks, we can deliver a great data consumer experience with minimal effort. One of the data sources for Databricks is the data from our Mews cloud system, where we both synchronize data from the SQL database and process the logs. By looking at just pure data, consumers would need to work with Enum types or flags represented as integer values, parse serialized duration values, or convert UTC dates to the hotel time zone.

To streamline this process, we have built a data pipeline that extracts the necessary meta information from our system .NET assemblies and uses it together with the original data in the cleaning pipelines to obtain nicely readable data.

3. Provide a seamless onboarding process with Azure Active Directory

When we started with Databricks, we had a small group of early adopters helping us test the solution and give us initial feedback. We loved that the default authentication to Azure Databricks is done with Azure Active Directory, but Databricks has its own management of users and groups and managing this manually is unnecessary overhead.

Fortunately, we could easily set up the provisioning of users and groups with SCIM (System for Cross-domain Identity Management) between Databricks and Azure Active Directory. That way, whenever we onboard a new member, they are automatically granted access to Databricks SQL. Also, having the same structure of groups in Databricks, we use them to configure access control and permissions across our databases.

4. Get the most out of Databricks SQL

Finally, the key part to unlocking data is having Databricks SQL. Databricks have acquired Redash and successfully integrated it. The solution gives you a SQL interface with the option to create and share queries, build visualizations, and combine it all in the form of dashboards. On top of that, you can build alerts based on query results and have the notifications delivered to Slack, PagerDuty or anywhere with the webhook feature, e.g. a zap triggering our unified incident workflow.

5. Profit

All that combined allowed our R&D teams to deep dive into data on their own and set up their health check boards with monitoring based on both performance and business data. Now, anyone with basic SQL skills, whether it’s a product manager or team lead, can build their own reports. Because of this, our data analysts now have more time to focus on more complex tasks and analyses.

Summary

We are really excited about what Databricks is already enabling us to do and there is even more on the horizon that we have not touched on yet. The combination of Delta format, Azure environment and Databricks SQL is already incredibly powerful, with more functional updates coming soon such as Unity Catalog and serverless SQL endpoints.

We are continually building our data platform and have recently connected the Delta data sources through a Synapse SQL endpoint to Salesforce Analytics, which unfortunately supported neither Spark nor Databricks SQL endpoints. Yet it works well for our Power BI reports and, as of late, also Looker. All this gets us closer to having a single source of truth for our data and insights.

Try it! You can use Azure Trial together with 14-days free Databricks units. To get around the cluster limitation, you can set up your spark clusters to be a single node.

Or message me on Twitter or LinkedIn and let me know if you prefer Databricks or Snowflake, and why. 😊

Head of Data Science @MewsSystems; Former PhD student in CS Machine Learning @Matfyz #dataengineering #learningfromdata #machinelearning #javascript #python
Share:

More About