ResolvedMobilePay transactions are processing again.
On Friday, November 17th, Quickpay encountered significant operational challenges, affecting the delivery of callbacks and the processing of MobilePay Online and Vipps payments.
Root Cause Analysis
*Background: *About three months prior, Quickpay transitioned from an on-premise queue system to a cloud-based alternative. This change was not due to performance issues with the on-premise queue but was required for planned future updates to our platform.
Incident Details: At approximately 11:20 CET on Friday, Quickpay began a routine import of payments from another PSP into our system, a process typically conducted 3-4 times weekly. However, each imported payment inadvertently triggered a callback to merchants, overwhelming their systems and leading to the rejection of these callbacks.
This surge in rejected callbacks accumulated in our system, eventually reaching a threshold we had not previously encountered. This bottleneck also impacted MobilePay Online and Vipps transactions, which utilize the same queue system. The MobilePay Online and Vipps events are prioritized above callbacks to ensure new transactions are processed even if there is a delay in callbacks. However, as the platform reached the threshold, almost no events were processed.
Regrettably, our team was unable to manually intervene in the queue without deleting all items, which would result in irreversible data loss.
*Resolution: *By 13:45, we successfully reinstated our on-premise queue system. Subsequently, MobilePay Online, Vipps, and callbacks resumed normal processing. By 10:00 on Saturday, a portion of the failed events were processed in the queue, enabling us to process all remaining items, including pending MobilePay and Vipps payments.
Immediate Actions: We reverted to our on-premise queue system last Friday. This system, used up to six months ago, has been rigorously tested over several years, including during peak periods like previous Black Fridays, without issues related to the queue.
The incident was not due to overall platform load. We are confident in the system's stability in the upcoming months and, with the on-premise queue, we are better equipped to swiftly address similar issues if they arise.
*Long-Term Strategy: *In the future, we will transition back to the cloud-queue system. We are already sketching out enhancements that will enable us to process significantly more rejected callbacks than during the incident. We will also develop tools for more effective manual intervention. We will not transition to the cloud-system again before next year.
We acknowledge our failure to meet our communication standards. Despite promptly identifying the problem, internal miscommunication led to delayed updates to our customer support team and on our status page (https://status.quickpay.net). Post-resolution, our communication regarding the processing of queued events was also insufficient.
As part of a larger corporate group, we are exploring ways to leverage group resources to enhance our external communication capabilities.
We recognize the significant inconvenience caused to our merchants by this incident. We sincerely apologize for these disruptions. While we cannot guarantee the absence of future issues, we are committed to continually improving our platform and communication strategies.
We are again processing MobilePay transactions.
VippsPSP transactions have also been affected in the time frame.
We are still investigating the pending MobilePay transactions and delayed callbacks, and monitoring.