Back in 2019, we introduced Mews Terminals. Right from the start, this proved to be a popular addition to the Mews product offering. Between 2020 and 2021, the number of terminal devices deployed to hotels around the globe increased by 718%. And the volume of processed payments increased by 4085%. With such an increase in users, we started focusing more on the performance and speed of terminals. The main focus was to decrease the waiting time from the moment the terminal shows that the payment was successful to the payment being charged in Mews. After a series of improvements, we decreased the waiting time for the slowest 10% of payments by over 90%, nearly eliminated all payments taking more than four minutes, and decreased the overall waiting time by 70%.
Synchronous or asynchronous terminal solutions
To understand how we achieved this performance increase, we need to provide some background information on how the Mews terminals work.
The terminals are provided by one of our payment service providers. We can communicate with the devices through a cloud API, which means that we’re able to control the devices from anywhere in the world.
There are two ways in which you can connect to the terminal through this API. The most straightforward way is to connect synchronously, which means you get the result of the payment right away. But it also means that if the hotel guests take two minutes to find their card, put in their PIN, and complete the payment, your UI is blocked for two minutes. On top of that, there can be added issues with unstable internet connections.
The Mews terminals, however, use an asynchronous flow. In this flow, we send a request to the cloud API and later receive a series of webhooks indicating the payment state. This approach is more scalable as you don’t have to worry about open networks connections and the webhooks can be processed by background jobs.
When creating a terminal payment in the asynchronous flow, we go through the following steps:
- A payment request is sent to the terminal device.
- The customer completes the payment with their card.
- The payment service provider sends us a webhook, indicating that the payment was successful on the terminal (this happens almost instantly after the customer completes their payment.).
- The payment service provider sends us a second webhook, indicating that the payment has been authorized by the customer’s bank.
Going through the data
At Mews, we love working with data to get better insights into our application. So, at the beginning of the investigation, we started by looking at raw SQL data.
Initially, we looked at the average time between creating and completing a payment, and everything looked good so far. However, we still had some reports from our customers saying that the terminals were slow. So, we decided to interpret the data differently, and we ended up using percentiles instead. When we started looking into these numbers it became clear that 95% of payments were charged within 30 seconds, but the slowest 5% could take up to 20 minutes.
Coming up with a solution
Based on the results of our initial investigation, we threw some ideas around. One of which was to set some timeout so we wouldn’t wait for 20 minutes and instead mark the payment as incomplete after a set amount of time. This idea was quickly discarded as the UX would be bad. The terminal would show a successful payment, but minutes later it would be marked as failed in Mews.
We realized that we needed to look at the data again and find a different solution. Earlier in this article, I described the asynchronous flow with webhooks, in which we mark a payment as completed based on the webhooks we received. So, another logical place to investigate is these webhooks. After digging through the data once more, we realized a few things. The first of which was that the ‘authorized’ webhook comes considerably later than the initial ‘success’ webhook. In some cases, up to 20 minutes. This indicated an area for improvement, as we only considered a payment as charged in Mews after receiving the ‘authorized’ webhook.
Do you like money and want to work with them?
Matthijs and the whole backend team are always looking for new colleagues to create real value. 💰
The aha moment
We realized that all the information needed to consider a payment as completed was already in the first ‘success’ webhook. Maybe we didn’t have to wait for the second ‘authorized’ webhook, after all.
Not waiting for the second webhook would mean a huge performance improvement because the first notification arrives almost instantly after the customer completes the payment on the terminal.
However, this was a potentially dangerous change. We wouldn’t want to have payments marked as charged in Mews, when in reality they are not completed yet. With this knowledge at hand, we continued going through the data and came to another realization. There wasn’t a single payment where we had received only a ‘success’ webhook without later receiving the ‘authorized’ webhook. So, it seemed that based on the historical data, there weren’t any huge risks.
It seemed like we had been too pessimistic. We only considered the payment as charged after receiving the second webhook that came noticeably later. Instead, we could have marked the payment as charged after receiving the first webhook, which we receive almost instantly after the customer finishes their payment.
After this discovery, we made a relatively small code change, to optimistically mark the payment as charged after the first webhook. Throughout the process, we took a fail-fast approach so we would see issues soon and could respond quickly if needed.
However, the biggest challenge wasn’t the changes in code but the monitoring of the new flow. Before we pushed any changes to production, we created a dashboard in Databricks. On top of that, we extended our error reporting to report any inconsistent payments as soon as possible. After preparing the dashboard, the ‘feature’ was released under a feature flag, and we started monitoring the new behavior.
Right away, we saw a clear improvement in payments using the new flow. In the end, we improved the overall performance by over 70% and eliminated the waiting time for the slowest 10% of payments.
Simply by using data and being optimistic while staying prepared for worst case scenarios.