Availability Is More Than Just Uptime - Part2

Fault Tolerance in Modern Systems

Long story short:

In the previous part, I wrote about how application availability is much more than just uptime. We looked at

SLA, SLO, and SLI frameworks, and how they help us measure and objectively interpret the reliability and availability of our systems.
 

But what can we do if outages are increasing, numbers are getting worse, and we can’t hit our targets? How do we defend against cascading failures that can take down our entire system?

In this part, we’ll explore two fundamental building blocks without which scalable systems remain theory only: Service Discovery & Dynamic Routing and Cascading Failure Protection & the Circuit Breaker Pattern.

Silurus Uptime Fault Tolerance 2

1. Service Discovery and Dynamic Routing

Why does it matter?
In a distributed system, the key question is not “how many services are running on how many instances,” but how they actually find each other.

If you rely on static IPs and ports, your system will be as stable as a phonebook in a fast-growing startup will be obsolete within minutes.

As microservices scale up and down, new instances appear while others shut down. Hardcoding instance IPs quickly devolves into chaos. This is where service discovery comes into play: services dynamically register themselves in a central registry and query the current addresses of their peers.

Popular solutions include HashiCorp Consul, Netflix Eureka, or etcd. For load balancers, this is critical: without a dynamic list of backends, they’d be blindly routing traffic to stale addresses.

How does service discovery work?

  • Service Registry: A central database (e.g., Consul, Eureka, etcd) where all instances register themselves.
  • DNS-based discovery: In Kubernetes, CoreDNS and kube-proxy ensure service names always resolve to the right pod IPs.
  • Sidecar pattern: Service meshes (e.g., Istio, Linkerd) attach a transparent proxy to each pod, handling discovery and routing logic.

 

Health checks: More than “up or down”
Not all pods are equal:

  • Starting, but not ready yet
  • Responding, but internally failing
  • Permanently down

That’s why Kubernetes uses readiness and liveness probes:

Silurus Code
  • Liveness: ensures the pod is still functioning (e.g., restarts on deadlock).
  • Readiness: ensures the pod is ready to serve traffic.
  • Startup probes: useful for containers that take a long time to initialize.

This guarantees that only healthy pods receive traffic.
And don’t forget: graceful shutdowns and rolling updates are part of any resilient system.

2. Cascading Failure Protection and Circuit Breaker Pattern

The domino effect
In microservice architectures, the biggest threat isn’t one service going down it’s one slow service taking down everything else.

Imagine this:

  • Service A calls Service B, which responds slowly.
  • A’s threads are exhausted, and it stalls.
  • Service C, which depends on A, also stalls.
  • Within minutes, the entire system collapses all because of a single misbehaving component.

How to prevent it?

1. Circuit Breaker Pattern

  • If a downstream service fails repeatedly, the circuit “opens” and further requests are rejected or served with a fallback response. This prevents resource exhaustion.

 

Example in C# with Polly:

Silurus Code2

Example in PowerShell with PSPolly:

Silurus Code 3

2. Bulkhead Isolation

  • Services should not share thread pools or DB connections. If one fails, it shouldn’t drag others down.

3. Retry Policies

  • Avoid infinite retries. Use exponential backoff with limits (e.g., 1s → 2s → 4s → up to 30s).

4. Proper Timeouts

  • If a dependency never responds in under 2 seconds, don’t wait 30.

Tools and frameworks:

  • Istio/Envoy (built-in resilience features)
  • Polly (C#) / Resilience4j (Java)
  • Netflix Hystrix (classic, though less used today)

Verdict

The technologies discussed in this part greatly contribute to building a stable system and ensuring that, as SREs or DevOps engineers, we can sleep peacefully at night.
Service discovery and probes help ensure that when one of our pods fails (for example, due to an OOM error), a new and fully functional instance can be created easily and quickly, and that the traffic is correctly routed to it. Meanwhile, the circuit breaker and cascading failure protection mechanisms help prevent the entire infrastructure from collapsing because of a single weak link. Fault tolerance is not an optional convenience but a fundamental requirement of modern architectures. The earlier we build it into our systems, the lower the price we’ll pay later.

What about you?
How have you implemented fault tolerance in your systems? Do you use a service mesh, or stick to simpler libraries like Polly or Resilience4j? I’d love to hear your experiences share them in the comments!

Leave a Reply

Your email address will not be published. Required fields are marked *