In December 2021, our tech team reached 100 people, consisting of backend engineers, frontend engineers, mobile engineers, QA engineers, IT specialists, data analysts, engineering managers, directors and many other roles. That’s almost one-third of Mews. Nine years ago, when I started at Mews, I could hardly imagine not only such a big team but also the product and technological surface that we’d be working on, maintaining and improving. One might rightfully ask why we need such a big team? Unlike most startups and scaleups, our product portfolio is quite broad, considering the size of our company. We offer our primary product to hotels, currently around 2500, and their employees. Besides that, we build apps for the hotel guests (B2B2C), we have an open API and marketplace with more than 500 technological companies integrated with us. And on top of that, we facilitate payments between hotels and their customers. So we’re a fintech company to an extent as well.
The product improvements and new features get all the accolades and spotlight. However, there’s a parallel track of technical work most people are not aware of and that’s being done “behind the curtain”. It is as crucial as product improvements, especially from a longer-term perspective. In this article, I’d like to focus on this workstream and highlight some of the technical enhancements we made in 2021. Then, describe how we elevated the platform to make it more secure, reliable, scalable and modern. Besides that, I’d like to outline what’s ahead of us in 2022.
Our engineering organization is divided into two main divisions: product engineering and platform engineering. The distinction is straightforward, product engineering serves the standard external customers (hotels, guests, API partners…). Platform engineering primarily serves internal customers (teams and developers inside Mews) and indirectly influences external customers. When it comes to projects driven by technical needs, everybody contributes, both product and platform teams. Platform teams spend most of their time on technical initiatives; product teams have to balance investment into product work with technical tasks. Usually, the ratio is around 2:1 but depends on the individual team. So, everybody in engineering is responsible for the health of our codebase and the platform. And we all work together to progress on our path towards technical excellence.
A year of work is a pretty long period, and I was surprised by how much we’ve done when I put together this article’s contents. There is an underlying theme, though. Many of the initiatives aimed to catch up with state-of-the-art technologies, the latest tools, and framework versions. Technology evolves so fast it is sometimes tough to keep pace, especially in a startup environment. However, I’m happy we’ve not only caught up but also become early adopters in a few areas.
The most awaited initiative among all backend engineers was migration from .NET 4 to the latest version of the .NET. We’ve opted for an iterative approach to slice the project, deliver some value early in the year, and minimize risks. So we’ve first migrated to .NET Core 3.1, then to .NET 5 and finally in December to .NET 6. The most significant outcomes were the general improvement of backend performance and enablement of new language features of C#. Speaking of performance, throughout the year, we’ve decreased the server P95 response time more than 2x thanks to the latest .NET and more minor improvements like optimizing the handling of users with lots of accounts. Internally, we’ve decreased the test execution time by 60%, making our developers more productive.
In terms of infrastructure, we’ve added Redis caches to our cloud stack, mainly to improve the resiliency of our public APIs. Moreover, we’ve developed a framework for cooperative timeout handling, ensuring we don’t unnecessarily overutilize our infrastructure and give our partners guarantees to rely upon. Later in the year, we employed Azure Front Door to reduce latency worldwide because we have clients from California to Sydney. It also enables more advanced routing scenarios that Traffic Manager doesn’t support. Management of the infrastructure can get messy at scale. Therefore, we kicked off Pulumi as our infrastructure as code solution, started using it for all new types of resources, and migrated many existing ones.
Users of our APIs are sometimes “creative”, so we need to make sure we support their use-cases while guaranteeing the platform’s reliability. To isolate different workloads, we’ve split our backend into three categories and sets of instances: public API traffic, internal application traffic and WebSocket traffic. That allows us to scale optimally according to the workload while minimizing the impact of traffic on other types. As part of that, we’ve finally migrated from our legacy system domain mews.li to our main domain mews.com. To make usage of our APIs easier, we’ve introduced Swagger, allowing partners to generate their SDKs instead of manually implementing them. Finally, we’ve established a privilege system on the API level that we’ll utilize to openly communicate to hotels what each integration can and cannot do.
It’s not only about incoming traffic, but also outgoing. Two-way integrations are the most efficient due to the elimination of unnecessary polling. Therefore, we introduced webhooks as part of our main API contract. We invested a lot in the infrastructure in the channel management area (communication of price and availability to booking sites) due to the increasing number of direct connections and data. We’ve adopted a multi-queue model for the outgoing data that minimizes the computation needed, increases resiliency and decreases processing times which is the goal. As a nice side-effect, hotels now have complete visibility into this process which is historically a backbox and nightmare to investigate.
Functional programming is an inherent part of our engineering culture. We’ve unified our transactional layer (web, API, jobs) to a fully functional approach with all transactions annotated by the owning team, data consistency requirements, and other attributes at the beginning of the year. We’ve made our data entities more strongly typed on the other end of the stack, and we’re utilizing ADT (algebraic data types) much more. Especially in the payments area, we’ve adopted a functional approach to error handling using Result type instead of exceptions.
We’ve chosen New Relic as our primary provider of logging, application performance monitoring, availability monitoring, alerting, and other services to improve the observability of our systems. We’ve migrated all our logs there, not only the ones produced by the backend but also frontend or mobile ones. And we plugged it into the unified incident workflow. To handle incidents well, we’ve established several SRE (site reliability engineering) practices like incident runbooks.
We open-source our fiscalization libraries, and throughout the year, we’ve invested in making them more maintainable and extensible in the future. For example, we unified them into a monorepo, extracted the core library and improved the build pipelines.
A clearly visible part is our design system. The main focus was on its adoption. We migrated both our booking engine and guest portal to fully adopt the design system at the beginning of the year. And we eliminated some custom components that were used by the applications. Later, we kicked off the transition of the biggest app for hotel employees and migrated mainly the lower-level components like inputs, alerts, dropdowns, forms etc. An outcome of the adoption is a general increase in quality, accessibility and user-friendliness thanks to replacing some ad-hoc components with polished ones from the design system. We focused on such improvements in the last quarter and made the reservation timeline, data visualization colors, date pickers, search boxes, and other components better. Our company had undergone a rebrand, which was when the design system really shined. We adjusted to a new visual identity in almost no time.
As our client applications are getting bigger and “thicker”, we focused more on their observability and performance. We’ve implemented black box synthetic monitoring and started collecting metrics about memory utilization, long tasks (client CPU utilization), and client-side latency to get visibility into application performance. Thanks to that, we were able to rectify some of the issues and make our applications more performant and less resource-intensive. One example is customizable naming conventions in our APIs, allowing the clients to choose between the pascal case and the camel case and eliminating any transformations due to incompatible casing. Finally, we improved our applications’ bundle sizes and decreased time-to-interactive by lazy-loading parts of the applications that are not immediately needed.
To follow the functional programming hype in Mews, we’ve introduced fp-ts into our tech stack and started adopting several functional programming principles. We finally got rid of our custom dependency injection container and started using React services. And we’ve rewritten our window/modal manager, which brings us increasingly closer to a proper single page hotel application. The booking engine and guest portal are already there, BTW.
We have just 4 mobile engineers in Mews, so efficiency is imperative. In the first half of the year, we’ve migrated our kiosk application to Flutter, enabling us to share more code with the other application for hotel employees. And it also simplified our mobile tech stack, which is valuable on its own. Originally the kiosk was just for Android, so after migration to Flutter, we successfully released a version of the kiosk for iOS. And since kiosk is not a standard application, we’re now piloting an MDM (mobile device management) solution for iOS kiosks.
Having all applications in a single technology, Flutter, allowed us to develop a single version of our mobile design system and ensure our applications follow the branding and visual identity. On the lower level of code, we’ve migrated from custom BLoC implementation to bloclibrary.dev. And we started using null safety in Dart and integrated additional static code analyzer dart_code_metrics.
For the year 2021, we had one big goal for data engineering. To choose and implement our ultimate data warehouse solution that would be used not only by data analysts in product teams but in the longer term by everyone in the company as the primary source of truth and the go-to place for data. We’ve researched and prototyped 3 stacks and decided to go with Azure Databricks as the data lake and Looker as the business intelligence tool.
Throughout the year, we’ve started ingesting data from various data sources to Databricks, e.g. from the central system database, Application Insights logs, Heap, GitHub or YouTrack. That allowed us to migrate our product reporting from Power BI on top of SQL but also enabled engineers and product managers to build reports and dashboards, on their own, directly in Databricks. In addition, we’ve connected the DWH to our unified incident workflow to stay on top of issues. And as a pleasant side-effect of Databricks, we are also speeding up the Mews Analytics pipeline and update rate.
Moving on to quality assurance, the focus in 2021 was on our end-to-end (E2E) testing framework and its adoption in the organization. As a first step, we migrated from C# tests written in Coypu to a much more approachable Robot framework. That lowered the barrier for entry for all our QA engineers and enabled them to easily automate repetitive tasks. Thanks to that, coverage of our apps kept growing throughout the year. Most of our applications and screens are covered with E2E smoke tests, making our test suite a solid quality guarantee. In addition, we no longer run them asynchronously against the development environment from TeamCity. Instead, we made the E2E tests a mandatory precondition before each deployment within our Azure DevOps build pipelines.
Later in the year, we continued with E2E test developer experience. We established team ownership on the level of each test. The tests are now running against a dynamic environment created for each run with data prepared for each E2E test. And to remove the dependency on the backend, we improved the E2E framework to detect the required test data and add it automatically, which allows the teams to create tests without the need to adjust the backend code.
IT and Security
Our ITS team supports all Mews employees from the IT perspective and oversees organizational security and audits. In the past year, we’ve successfully passed the ISO 9001, ISO 27001, NF525 and mainly PCI-DSS level 1 with increased scope, conducted by Verizon. The audits are a perfect opportunity to learn and improve if you approach them with an open mind and positive attitude, not as an annoyance. As direct or indirect follow-ups, we’ve deployed Microsoft Intune and Microsoft Defender to all devices within the company. We’ve bumped Microsoft licenses of all employees to the highest possible (E5), allowing us to apply all necessary security policies. We’ve consolidated most domains and their management to NameCheap and moved DNS management to Azure. And we improved our internal network perimeter.
To provide a counterparty to the ITS division regarding product security, we kicked off an independent product security team that will maintain our existing initiatives (e.g. continuous penetration testing by cobalt.io) and find and implement opportunities to improve our platform security. In addition, to better assess the risk of third-party vendors, the ITS team added Panorays into its toolbelt. And to keep our employees alerted, we’ve run several security drills, exercises and training courses.
The most visible change from the employee experience perspective was migration to a new knowledge base. After a thorough assessment, Guru came on top, and after several months of using, I can confirm it’s well deserved. We’ve also deployed Microsoft Autopilot for all new starters to simplify their onboarding. Finally, we automated email signatures and signature campaigns across the whole company.
We’ve participated in several large-scale RFPs with the most prominent hotel brands, and we’ve gone through various audits. In all cases, the platform and technical capabilities are the main things being judged and assessed there. The good news for us is that we are coming out well, passing the audits, ticking pretty much all the boxes in RFPs. However, we want to go above the mandatory requirements and “proactively” set the bar higher rather than only “reactively” fulfilling the minimum. The focus on technical features will only grow as we enter the enterprise segment. And for us to be the category leader, we need to be the ones who define the standards, who raise the bar for others, who blow the RFPs out of the water. The technical execution of our product should become one of the differentiators that we delight customers and beat the competition with.
Our main areas of focus will be the technical maturity of the platform and its evolution going forward. The highest priority will be platform security and reliability from the customer perspective. Those are non-negotiable for us; they impact all existing customers and matter for prospective customers, especially the bigger ones. In terms of evolution, we need to ensure we can scale together with the increasing number of both clients and developers in our teams. Modularization of our applications will be a topic we’ll definitely be addressing. We also want to support our engineers with the latest technologies to retain them and attract top talent since the quality of technology is one of the decision factors for developers.
Are you interested in participating in this journey? Let us know and join us. Would you like to learn more about any of the topics? Let us know as well, we want to openly share our experiences, and we have experts on all the topics mentioned above who would be happy to speak to you or at some event.