Your Production Health Checks Are Lying to You

The dashboard was green. The alerts were quiet. Every service showed up. And for forty minutes, the checkout page was silently broken. Most health checks confirm the process is alive. They say nothing about whether it can actually handle requests.

Your Production Health Checks Are Lying to You
Server health check dashboard showing OK status while production checkout requests time out, illustrating misleading monitoring signals, hidden infrastructure failures, database connection pool exhaustion, and unreliable application readiness checks in modern distributed systems.

This is part of The True Code of Production Systems. The series is about the decisions that only become visible when something breaks in production.


The dashboard was green.

The on-call engineer had checked it twenty minutes earlier, seen nothing alarming, and gone back to what he was doing. The alerts were quiet. The monitoring panel showed every service up. Response times looked normal.

Meanwhile, users on the checkout page were hitting timeouts on every order attempt. Not errors. Timeouts. The kind that make users refresh, try again, give up, and leave. For forty minutes, the platform's highest-converting page was effectively broken.

When the team finally traced it, the cause was not a crashed service, not a bad deployment, not a network issue. The service was running. The process was healthy. The production health check endpoint was returning 200 OK on every single probe.

But the database connection pool had been exhausted for forty minutes. Every request that reached the service was waiting for a connection that was never coming. The health check had no visibility into this. It was checking whether the process was alive. It had no opinion on whether the process could actually do anything useful.

That is the core problem with most production health checks. They do not lie aggressively. They lie by omission. They confirm that the lights are on while saying nothing about whether anyone is home. And when the incident finally surfaces, the team is left debugging a production-only problem with no signal from the very system that was supposed to provide one.


What a Health Check Is Actually Supposed to Tell You

A health check exists to answer one question for the infrastructure routing traffic to your service: should I send requests here right now?

That sounds simple. In practice, most teams build health checks that answer a different question entirely: is this process running? Those are not the same question, and treating them as equivalent is what causes forty-minute incidents to go undetected while the dashboard stays green.

The infrastructure consuming your health check, whether that is a Kubernetes cluster, a load balancer, an Azure App Service, or an AWS target group, uses the response to make real decisions. A healthy response means traffic continues. An unhealthy response means the instance gets pulled from rotation, restarted, or flagged. The health check is not instrumentation. It is a control signal. When it gives the wrong signal, the infrastructure makes the wrong decision, and the consequences reach users.

Most health check implementations never think about this. They add an endpoint, return a status code, and consider the requirement satisfied. The endpoint does its job of responding. The job it was supposed to do, accurately represent whether the service can handle requests, never gets done.

Ask yourself: if your most critical dependency failed right now, would your health check response change? If the answer is no, your health check is not measuring health.

The Process Is Alive. The Service Is Not.

The most common form of health check failure is also the subtlest. The check confirms liveness, that the process is running and can accept a TCP connection, while staying completely silent about readiness, whether the service is actually prepared to serve production traffic.

A process can be alive and useless at the same time. It might be running but still loading configuration. It might be running but unable to acquire a database connection. It might be running but sitting behind a saturated thread pool where every incoming request queues indefinitely. In every one of these scenarios, a liveness-only health check returns healthy. Traffic keeps arriving. Users keep waiting. Nothing in your monitoring indicates a problem until the situation has been festering long enough for a human to notice.

This is not a theoretical edge case. It is one of the most common patterns in production incidents involving microservices and containerised workloads. The service is technically up. The health check says so. The logs say so. The process table says so. The users experience something entirely different. This is part of a broader problem: most developers ship for the happy path and discover production's real demands later.

The distinction that matters here is between liveness and readiness. Liveness answers the question of whether the process should be restarted. Readiness answers the question of whether it should receive traffic. Kubernetes makes this distinction explicit with separate probe types, and teams running on Kubernetes often get this right because the framework forces the conversation. Teams running on load balancers, managed services, or custom orchestration skip this distinction far more often, and they pay for it in exactly the kind of incident described above.

Two endpoints. Two different questions. Two different consumers with two different needs. This is not over-engineering. It is the minimum required to make the control signal accurate.

Ask yourself: does your current health check distinguish between "the process is alive" and "the process is ready to serve requests"? If it does not, which incidents in your history might have gone differently if it had?

Checking Connectivity Is Not the Same as Checking Capability

The next failure mode is more deceptive because it looks like due diligence. The health check does not just ping the process, it also checks dependencies. It connects to the database. It hits the cache. It verifies the message queue is reachable. This looks thorough. It is still missing the point.

Reachability is not capability. A database that responds to a ping in two milliseconds is not the same as a database whose connection pool has capacity to serve another request. A cache that accepts a PING command is not the same as a cache that is not already evicting data under memory pressure. A message queue that responds to a health probe is not the same as a queue whose consumer lag has grown to the point where new messages will not be processed for twenty minutes.

The health check that tests reachability and declares the dependency healthy has produced a health check false positive. The surface test passed. The underlying capability is degraded. Traffic continues flowing to a service that cannot actually process it effectively.

What the health check needs to measure is not whether it can reach the dependency but whether the dependency has the capacity to serve another request. For a database, that means checking whether the connection pool has available connections, not just whether the database server responds. For an external API, that means checking recent response times against an acceptable threshold, not just whether the endpoint returns a non-error status. For a queue, that means checking consumer lag if that lag affects your service's ability to do its work.

This requires knowing your failure modes before you write the check. For each dependency your service relies on, there is a specific condition under which that dependency is reachable but not usable. Your health check should be built around those specific conditions. A generic reachability ping tells you almost nothing about whether those conditions are currently present.

Ask yourself: for each dependency your service calls, what is the specific condition that makes it reachable but not useful? Is your health check capable of detecting that condition?

Stale Health Status Is Still a Lie

Caching is almost always a reasonable engineering choice. Health check responses are one of the exceptions.

When a health check caches its result and serves the cached response to the next probe, it is not reporting the current health of the service. It is reporting what the health was at some earlier moment. The gap between those two things is exactly the window during which your service can become unhealthy without the infrastructure knowing.

A database connection pool that exhausts itself in thirty seconds produces a cached health check that reports healthy for the duration of whatever TTL was set. A dependency that starts returning errors after your last health evaluation is invisible to the next probe. The infrastructure keeps routing traffic. The monitoring dashboard stays green. The staleness of the signal is indistinguishable from freshness.

Health check endpoints should never be cached. Every probe should trigger a live evaluation. The computational cost of this is small. Health checks should be lightweight by design, fast checks that hit the specific conditions that matter, not expensive operations that justify caching to reduce load. If your health check is expensive enough that you are tempted to cache it, the health check is doing too much of the wrong kind of work. For teams who want a second layer of production confidence on top of honest health checks, synthetic monitoring fills the gap that passive probes cannot cover.

The same principle applies to background health workers that update a shared flag. If the worker runs every fifteen seconds and sets an isHealthy flag that the health endpoint reads, the health endpoint is already serving a value that can be up to fifteen seconds old. If that background worker itself fails silently, the flag stops updating entirely, and the endpoint serves whatever the last known state was indefinitely. The health check continues to respond. The response is no longer meaningful.

Ask yourself: is there any layer of caching, buffering, or asynchronous state between your health probe and the actual live evaluation of your service's ability to handle requests?

Misconfigured Timeouts Create Incidents of Their Own

The timeout on a health check probe is not just a configuration value. It determines how quickly the infrastructure can react when a service degrades, and it determines whether healthy services get incorrectly pulled from rotation during load spikes.

Set the timeout too tight and you generate health check false positives during normal traffic spikes. A service that is slightly slower than usual, but still healthy and serving requests successfully, starts failing health probes because the probes are timing out before the response arrives. The infrastructure removes the instance. Traffic redistributes to fewer instances. Those instances experience higher load. They start timing out on health probes too. A load spike that the system could have handled becomes a cascading failure that the health check configuration turned into an outage.

Set the timeout too loose and a degraded service gets the benefit of the doubt for longer than it should. A service that is genuinely struggling, taking four seconds to respond to what should be a millisecond health probe, continues to receive traffic because the probe timeout is set to ten seconds. Users experience slow responses or failures while the infrastructure treats the instance as healthy.

The health check timeout should be lower than the timeout on your actual service endpoints. A health check that takes as long as a production request to respond is either measuring the wrong things or is itself indicating that something is wrong. Either way, the infrastructure should detect it faster than the current timeout allows.

The right calibration comes from measuring actual health check response times under load, not from picking a number that feels comfortable in a staging environment where the service is the only thing running on the machine.

Ask yourself: what is your health check timeout set to, and was that number chosen based on observed response times under production load, or was it a default that nobody has revisited since the service was first deployed?

Why This Keeps Happening

Health checks are written once, usually during initial deployment, usually as the last item on a checklist that already has too many items on it. The requirement is to have a health check endpoint. The endpoint gets added. The requirement is satisfied. The question of whether the endpoint accurately represents the service's ability to handle requests rarely gets asked.

Once it is shipped, the health check is trusted. It becomes part of the infrastructure that is assumed to be working. When an incident happens and the health check was reporting healthy throughout, the post-mortem notes it as a monitoring gap and moves on. The health check is not the cause of the incident. But it is the reason the incident lasted as long as it did, and that rarely makes it into the action items.

The result is a widespread pattern where the most commonly trusted signal in production monitoring is also one of the least carefully engineered pieces of the entire system. Every team believes their health checks are working. Most teams have not tested them against their actual failure modes. Most teams would struggle to answer, without looking it up, exactly what conditions their health check can and cannot detect. This is the same gap that makes systems hard to operate at 2am, the signals exist, but they are not honest enough to be useful.

That gap between assumed coverage and actual coverage is where incidents live.

Ask yourself: when did you last deliberately test your health check against each of your service's real failure modes? Not in staging, not with the happy path, but against the specific conditions that have actually caused incidents in production.

Before the Next Deployment

Every health check going into production should be able to answer these questions clearly.

Does it distinguish between the service being alive and the service being ready to handle traffic? Does it check whether dependencies have capacity, not just connectivity? Is every evaluation live, with no caching anywhere in the path? Is the timeout calibrated against real response times under load, not against a number that felt reasonable at the time? Does it cover the actual failure modes this specific service has, the ones that show up in incident history, not the ones that seemed likely during the initial design?

If any of those answers is uncertain, the health check is not ready for production. The service may work. But the control signal telling the infrastructure what to do with that service is not reliable. And an unreliable control signal, in a system under real load, produces exactly the kind of incident that nobody can explain until they trace it back to a Tuesday afternoon when someone added a /health route, returned a 200, and moved on.

The green dashboard is not the truth. It is only as honest as the checks producing it.

Build health checks that tell the truth, especially when the truth is that something is wrong.


More from the series

Silence Is a Design Decision - A health check that lies and logs that stay silent are the same blind spot at different layers.

Caching Is Easy. Production Caching Is Not. - The same pattern as this article: something appears to be working, the dashboard is green, and the real problem is already festering.


The True Code of Production Systems is a series about the decisions that only become visible when something breaks in production.
Read the full series at
The True Code of Production Systems