Intermittent Timeouts in Financial Services Cloud Infrastructure

When Intermittent Cloud Timeouts Turn Business-Critical in Financial Services

Some incidents announce themselves loudly. Others arrive as scattered symptoms that are much harder to pin down.

Recently, Pipe Ten supported a financial services partner through exactly this kind of challenge. Intermittent connection timeouts began appearing between cloud-hosted workloads and services across its wider hosted estate. The failures were sporadic, difficult to reproduce, and not confined to one application or one traffic path. Some requests completed normally and others timed out. The symptoms appeared across different services, at different times, and on both inbound and outbound flows.

This is the kind of problem that quickly becomes business-critical. There is no single obvious fault domain. The cause could sit in the application, the firewalls, the load balancers, upstream routing, third-party dependencies, or a combination of several factors. To complicate matters further, the issue surfaced shortly after a scheduled firewall upgrade, making recent change the most obvious early suspect.

Pipe Ten did not jump to conclusions. We led the incident as an evidence-driven investigation, working to narrow the field, protect service, and give the partner a clear technical path forward.

Start with the obvious, but don’t stop there

We began with a simple question: was there a clear fault inside the hosted estate?

To answer that, we reviewed service and system logs, checked firewall state and failover behaviour, analysed load balancer activity, and compared partner-reported symptoms against our own monitoring. Early on, nothing pointed to an obvious fault in the local platform. We could see the impact being felt by the partner, but not a clear signature of failure within the application stack or core infrastructure.

That was useful in itself. It told us not to waste time defending assumptions or chasing a convenient theory. It told us we needed broader visibility, so we expanded the investigation.

Build a clearer picture

To understand whether the problem was tied to a specific route, location, or direction of travel, we set up repeated network and application-level tests from multiple angles. That included continuous TCP and HTTPS checks, traceroutes, MTR runs, and comparisons between traffic entering the platform from the cloud and traffic leaving it towards cloud-based endpoints. We were careful to correlate observations across different measurement tools, as each tells a slightly different story about what the network is doing.

We also worked closely with our partner to gather precise failure timestamps. That proved essential. Intermittent incidents often look random until multiple datasets are placed side by side. Once we correlated their timestamps with our own test data, a pattern began to emerge.

The strongest evidence suggested that the timeouts aligned with instability on intermediary network paths between the cloud provider and our upstream connectivity. The cloud provider’s ECMP (Equal-Cost Multi-Path) load balancing was distributing traffic across multiple upstream paths, and some of those paths were showing signs of instability. We observed route changes and path variation that looked inconsistent with stable handling of long-lived, business-critical traffic. Internet routing can and does change naturally, but the timing and frequency here gave us reason to believe upstream path instability was contributing to the failures.

Validating the environment

Even when the evidence starts to point upstream, it is still important to challenge your own assumptions first.

Because the incident appeared shortly after firewall work, we deliberately tested that theory rather than dismissing it. We failed traffic across to the alternate firewall, reviewed state and logging in detail, ran further checks for anomalies, and went further by downgrading one firewall and rebuilding the other.

Those steps did not materially change the failure pattern which was an important outcome to help us significantly narrow the scope of the investigation with confidence. By that stage, we could show that the issue persisted independently of the recent changes.

Escalating with evidence, not opinion

Intermittent network issues are notoriously difficult to escalate. Without strong evidence, they are easy for upstream providers to treat as transient or inconclusive.

That is why correlation mattered so much. By combining user reported failures with our own route analysis and repeated test results, we were able to build a much stronger technical case for escalation. Instead of saying something felt wrong, we could identify specific time windows where application timeouts and network path instability appeared to overlap.

That changed the conversation. The discussion moved from opinion to evidence, and from guesswork to investigation.

Moving from diagnosis to mitigation

Strong incident response is not just about finding the most likely cause. It is about reducing risk while the wider investigation continues.

Once it became clear that the public path between the cloud provider and the hosted services was a likely contributor, Pipe Ten moved from diagnosis to mitigation. We designed and deployed a practical workaround using the partner’s private connectivity, creating an alternative route for critical traffic that did not rely on the same public internet path.

This involved creating a dedicated load balancer on a network that was reachable over Direct Connect, validating the path end to end, updating firewall rules, and working closely with both the partner and the upstream provider to resolve route advertisement and propagation issues. There were a few twists along the way, but once the change was in place it did exactly what it was intended to do. It gave the customer a safer, more stable path for critical traffic while the wider issue was still being worked through.

In simple terms, we did not wait for the wider internet to settle down. We engineered a safer route and kept the traffic moving.

What this incident reinforced

This incident reinforced what a managed service should look like in practice.

It means responding in minutes, not days, when critical services fail.

It means not assuming the last change is automatically guilty.

It means being willing to challenge your own platform first, even when that is inconvenient.

It means collecting enough evidence to make escalation credible.

It means building practical mitigations while the root cause is still being investigated.

And most importantly, it means treating your partner’s problem as your own, even when the final resolution may sit outside your estate.

The outcome

Within minutes of the partner reporting the issue, Pipe Ten’s team was actively investigating. Over the following three days, while the cloud provider’s support processes worked through their standard escalation procedures, our team worked around the clock – investigating, testing, building evidence, and ultimately engineering a live workaround that protected critical services.

This is the difference a managed service partner makes. When a cloud provider’s support operates on ticketing timelines, Pipe Ten operates on incident timelines.

As the case developed, we put a temporary mitigation in place to protect service while the suspected upstream issue was pursued. That bought valuable time, reduced operational risk, and allowed critical financial workloads to continue functioning while the investigation remained active.

The cloud provider ultimately confirmed action was taken to stabilise the routing path. Once the path was stable and our testing confirmed the issue had cleared, we were able to remove the mitigation and safely reinstate standard traffic flows.

That was the outcome that mattered. Pipe Ten did not simply diagnose a difficult issue. We coordinated the response across multiple parties, built the evidence that moved the escalation forward, engineered a live workaround to protect service, and stayed hands-on until normal operations were fully restored.

Whether your workloads run in our data centres, a public cloud, or across both, a great managed service partner makes the difference between an incident that defines your week and one you barely notice. For Pipe Ten, being that partner means taking ownership when things are unclear, responding immediately when seconds count, and standing shoulder to shoulder with you until service is fully restored.