With success and fast-paced growth, there comes a point when companies serve vastly different customers. As your product gains more market share and accumulates more data, the need for personalized segmentation becomes obvious.
However, navigating the complex process, from selecting the right modeling methods to addressing data quality issues, can be daunting.
We have experienced this ride firsthand with two different use cases and are excited to share our key findings with you. This post delves into the good, the bad, and the ugly of implementing a segmentation project, and hopefully, makes your journey easier.
Disclaimer: This is not a technical tutorial, although we provide resources for them. Our focus is discussing and offering tips on the execution and organizational aspects of a segmentation data product.
Implementing your first clustering algorithm
We are two data analysts, Monika and Danica, who recently delved into our first segmentation project. It was a whirlwind of inspiring highs and challenging lows, and we were thoroughly amazed by the intricate challenges it presented at various stages. As we reflect on this process, we want to highlight the obstacles that tested our resilience (the bad, the ugly) and the wins that kept us going (the good).
In this collaborative blog post, we share our unique perspectives on unsupervised clustering, each approaching segmentation from two different angles that are relevant to SaaS companies:
- Customer segmentation: Allows you to tailor your product by identifying customer segments. It can improve everything from subscription pricing to marketing.
- User segmentation: Allows you to personalize user experience by identifying your customer’s users. It can improve everything from product feature development to UX.
If you are a data analyst who wants to better understand your product’s segments but lacks access to pre-labeled data, we will explore the benefits it can provide and how to avoid some of the common pitfalls of unsupervised clustering.
Monika – Getting a knowledge boost
At every step, there was an incredible amount of knowledge to gain as I explored various approaches alongside their pros and cons. Early in the data collection stage, I learned new methods for gathering different data types, and each proceeding stage brought forth more knowledge. Implementing my chosen method wasn’t always easy, but of course, I wasn’t the first to do it, so there was a lot of information out there.
To start, a quick Google search for ‘XY 101’ (or, in the worst case, ‘XY for dummies’) provided some initial guidance. As I progressed, I found immense support in consulting my exceptional colleagues. As this is how I learn best – watching others do their magic, I realized that watching YouTube videos with real examples works well for me too. And there are always endless solutions on online sources like Stack Overflow, Medium, and more if needed. Some resources that helped us get started included:
- Beyond One-Hot: An Exploration of Categorical Variables
- Using UMAP for Clustering & UMAP Dimension Reduction
- Compare 10 Unsupervised Clustering Algorithms – Iris
- Unsupervised Learning – Clustering Algorithms
While navigating this sea of information, it’s always a good idea to record your sources in a note-taking app, Google Docs, Notion, or a similar tool. Your future self will thank you in case you need to revisit them.
Danica – Leveraging your community
I have the immense privilege to be part of a team that embraces vulnerability, as discussing my imposter syndrome with my manager led to a memorable and important reminder that seeking help is the key to overcoming obstacles in a professional setting. Take advantage of the community you already have in your team to help solve those challenging roadblocks.
Outside of your company, there are many ways to become part of a community.
- Stack Overflow: The holy grail for community guidance.
- LinkedIn: Find great articles and posts on data science project struggles.
- Data science meetups: Talk with analysts in other companies and understand how they are solving similar challenges.
💡 Remember that even experts have a tough time with clustering. If you don’t believe me, here’s a discussion on the improvement that still needs to be made on validation metrics from this paper:
As stated by Jain and Dubes , “Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage”. More striking than the statement itself is the fact that it still holds true after 25 years, despite all the progress that has been made.
Monika – Information overflow
As mentioned above, there are countless approaches and guidance sources to follow. But each step presents so many methodologies, with new complexities and information at every turn. Eventually, all this can get overwhelming.
Imagine starting enthusiastically and keen on delivering results, only to realize the originally chosen model didn’t suit your dataset. Time lost; frustration gained. Simply put, not every model seamlessly aligns with a specific use case. So, what did I do?
The good old methodical resolution approach, where I should have started:
- Step one: Narrowing down my information channels. I chose a small number of articles as my main reliable information source, turning to others only if necessary.
- Step two: A thorough exploration of each approach, examining their features, pros and cons, and suitability – a sort of reality show for models. Armed with comprehensive insights, only then did I decide what I wanted to use and why. This way, it’s also much easier to justify the chosen strategy.
Danica – Data collection hurdles
I had to direct the sentiment of this meme inward as my biggest struggle in the data collection process was iteratively curating a “good enough” dataset for modeling. The real-world complexity of data collection means not having one single data source (i.e., one table in your database), but multiple data sources spanning several departments.
The data in these departments require multiple levels of expertise, so I found myself with varying levels of data quality for straightforward fields (e.g., nationality) versus fields whose logic depended on verification from relevant teams.
💡 It helped to have a list of these “dubious” columns that needed clarification and further improvements. For example, in my notes for this project, I had:
|Feature A||Clarification||Clarify the logic of this feature and/or the specific definition of the feature for myself and the end-user.|
|Feature B||Improvement||Specify: The current logic of the feature The improvements that can be made Any relevant links (i.e., a link to the Slack conversation discussing the improvement of said feature)|
Monika – Senseless results
By now, you’re probably wondering what the hardest part was. Picture this: you’ve invested countless hours and substantial effort in your research and carefully selected models for each stage of your hotel segmentation. But then reality hits: the results are far from the robust outcome you had anticipated – in fact, they are disappointingly weak.
As I dug through the final clusters, my gut told me something was off. I reminded myself that unsupervised learning is a challenge, but still, I had hoped for a better outcome. Okay, time for a fresh start. I revisited the research only to agree with my previous choices. Dissatisfaction and disappointment inevitably arose.
So, I simply took a step backwards, focusing on what felt wrong. Looking closer at the final clusters, I identified one that made the least sense; it appeared to be clustered based on one metric. Digging deeper, I grew suspicious that this metric was incorrect. It should be; I did the exploratory data analysis, right? Apparently, not thoroughly enough. As I delved into the data and its counting logic, I uncovered a mistake. Counting this metric, nulls were excluded, but 0’s were not (and now I knew they should have been). As simple as that.
Once corrected, the clusters fell into place, aligning with my expectations, and the world became a happy place again. The moral of this story is that even when the results of an overly complex topic are lackluster, the mistake can be surprisingly simple. Determination to catch the fault is what brings success.
Danica – Scaling big data
My initial data set included millions of records. I spent too much time on my first approach, which involved comparing the performance of four different models across varying sample sizes and parameters. Essentially, this involved hyperparameter tuning and finding the right sample size in a single step, based on the performance of the silhouette scores.
These models took a substantial amount of time to run, and when I ended up with the perfect model after all those patient hours of model running and tuning, I made a harsh discovery. The model that appeared to perform so well was actually overfitting the data, resulting in 100+ cluster groups.
💡 Here are some ideas I will definitely be implementing next time:
- Narrowing my data scope based on a chosen criteria with my stakeholders/collaborators (i.e., one hotel, country, or sub-region).
- After EDA, build out the skeleton code of the process (transforming variables, hyperparameter tuning, etc.) and run it with a very small sample size solely for debugging purposes.
- More does not always mean better. This may be controversial, but you don’t have to stick to the 80/20 split if there are diminishing returns with time to run vs. improved scores.
- Always include thresholds/caps on the most important output attributes (in my case, models resulting in less than 15 cluster groups).
- Choose the right model AND validation metrics based on:
- The type of data (categorical, numerical)
- The volume of data (some models will not work for big data)
- The shape of your data (are your clusters globular or density-based?)
The journey from messy data to meaningful clusters can be quite an adventure, and the path isn’t always straightforward. Our first segmentation project in a professional environment has been full of enlightening “Aha!” moments and exasperated facepalms.
On the positive side, it strengthened our team’s support and knowledge base as we leveraged the power of our professional community at work. We also discovered how valuable YouTube can be as an always-available mentor.
On the challenging side, some key takeaways from this project included the age-old saying “bigger isn’t always better,” particularly when it comes to big data. It also underscored the importance of making well-informed decisions, challenging your assumptions, and maintaining integrity. But perhaps the most important thing to leave you with is this: Whether you find yourself in the “good,” “bad,” or “ugly” stage in development, remember that there is a vast community of data scientists and analysts behind you.