Caching Is Easy. Production Caching Is Not.

Part of the series — The True Code of Production Systems

Caching Is Easy. Production Caching Is Not.

This post is part of the series The True Code of Production Systems.


The first time you add caching to a system, it feels like a superpower.

One afternoon of work. Response times drop. Database load drops. The whole system breathes easier. You ship it, you move on, and somewhere in the back of your mind you file caching under "solved problems."

That filing is the mistake.

Because caching in production is not one decision. It is ten decisions, and most teams only consciously make one of them: the performance one. The other nine happen by default, by accident, or not at all. And defaults in production have a way of becoming incidents.

This article is about all ten. But before we get into them, let us look at a system where one of those defaults caused a real problem.


A Booking System That Did Everything Right. Almost.

A platform handles seat reservations for corporate training workshops. Companies book seats for their employees. On a normal day the system serves around two to three hundred requests per minute. The engineering team is small but experienced. They built things carefully.

Workshop availability data, which shows how many seats remain for each session, is fetched from the database and cached in Redis with a TTL of sixty seconds. The reasoning behind this was sound: availability changes only when someone books or cancels, which does not happen constantly. Caching it for a minute seemed perfectly reasonable, and for months it worked exactly as designed.

Then a well-known instructor announced a new batch of workshops on LinkedIn. The post got shared widely. Within minutes, several hundred users landed on the platform simultaneously to check availability and book seats.

The cached availability keys for those workshops had expired seconds before the spike hit. Every one of those hundreds of requests checked the cache, found a miss, and went directly to the database. The database, which had been handling twenty to thirty direct queries per minute on a normal day, received several hundred simultaneous queries in the span of a few seconds. Connection pool exhausted. Query times climbed from milliseconds to seconds. The application started timing out. Users saw errors. Some refreshed, which made it worse. The platform was effectively down for four minutes during the highest-traffic window it had ever seen.

The cache was there. Redis was running fine. The TTL was set. Everything was configured.

Nobody had thought about what happens when a popular key expires at exactly the wrong moment.

We will come back to this system after the ten points. By then you will know exactly what went wrong and what a one-line fix would have looked like.


What Most Developers Think Caching Is

Cache the expensive query. Set a TTL. Use Redis. Done.

That mental model is not wrong. It is just incomplete. It describes caching as a performance tool, which it is. But in production, every caching decision is simultaneously three other things. It is a consistency decision, because data served from cache is data that may no longer reflect reality and you need to have an opinion about how much that matters. It is a reliability decision, because a cache that behaves unexpectedly under load can damage the very system it was meant to protect. And it is a cost decision, because the wrong caching setup charges you quietly, consistently, and across more than one line item on your cloud bill.

Most developers ship caching thinking only about performance. The other three dimensions show up later, usually at inconvenient moments, usually pointing back to a decision that was never consciously made.


The Ten Things Production Caching Actually Requires


1. Your Caching Pattern Is a Choice. Make It Deliberately.

Most developers use what is called Cache Aside without ever knowing they made a choice. The code checks the cache, finds a miss, goes to the database, stores the result, and returns it. That is Cache Aside. It is the most common pattern. It works. But it is one of four, and each one behaves differently in production in ways that matter.

Cache Aside puts the application completely in charge. You decide when to read from cache and when to write to it. This gives you flexibility, but it also means every invalidation is your responsibility. Miss one code path that updates the underlying data without clearing the cache, and you silently serve stale data. No error. No alert. Just users seeing something that is no longer true.

Read Through moves that responsibility elsewhere. The cache itself fetches from the database on a miss, so your application code only ever talks to one layer. This keeps things cleaner, but it creates a cold start problem. Every fresh deployment begins with an empty cache, and until it warms up, your database absorbs the full weight of traffic it is not used to handling alone.

Write Through writes to both the cache and the database at the same time, on every write. Your cache is always in sync with your database, which is a good property to have. The price is write latency. Every write operation now needs to complete in two places before it can return to the caller.

Write Behind is the aggressive option. Writes go to the cache immediately and the database gets updated asynchronously in the background. Writes are very fast. But if the cache node goes down before the async write completes, that data is gone. This is not a hypothetical edge case. It is a real failure mode, and unless you have consciously decided that some data loss is acceptable in exchange for write speed, this pattern is not the right one.

Before you deploy, ask: What is my consistency requirement? Can users tolerate stale data, and if so, for how long? Which pattern actually matches that requirement?

2. Cache Invalidation: Why the Joke Is Not Actually a Joke

There is a saying that has been repeated in software circles for so long it has become background noise: the two hardest things in computer science are cache invalidation and naming things. Most people chuckle and move on.

They should sit with it longer.

TTL-based invalidation is what most systems use. The key expires after a fixed duration and gets rebuilt on the next request. Simple, easy to reason about, requires no coordination between services. The downside is that TTL is a blunt instrument. Set it too long and users interact with data that no longer reflects reality. Set it too short and you repeatedly hammer the database, eliminating much of the benefit you were trying to get.

Event-based invalidation is more precise. When the underlying data changes, you immediately delete or update the cache key. Fresh data is served from the very next request. The challenge is coverage. You must ensure that every single code path in the system that can modify the data also triggers the invalidation. If you have five endpoints that update a user's profile and you handle only four of them, you have a stale data bug that will appear random and will take a long time to trace.

The situation that quietly destroys production systems is mixing both approaches across multiple services with no shared strategy. Service A invalidates via TTL. Service B invalidates via events. Service C was written by a contractor six months ago and nobody is quite sure what it does. The cache becomes a state that no single person on the team can fully reason about. That is not a theoretical risk. That is how most medium-sized systems actually look after two or three years of growth.

Ask yourself: Who owns cache invalidation in my system? Is there an actual strategy, or is each service doing its own thing independently?

3. The Cache Stampede: When Your Protection Collapses All at Once

This one catches even experienced teams off guard.

A popular cache key expires. At that exact moment, your system is handling high traffic. One thousand requests check the cache. All one thousand see a miss. All one thousand go directly to the database to fetch the data and rebuild the cache themselves. Your database, which the cache was specifically there to protect, suddenly absorbs a spike it was never provisioned to handle alone.

This is a cache stampede, also called a thundering herd. The irony is that the more effective your cache, the worse the stampede when it fails. High traffic systems with good hit rates put almost no direct load on the database during normal operation. So when the stampede hits, the database is completely unprepared for the volume.

Three ways to protect against it:

Mutex or locking allows only one request to rebuild a cache key at a time. Every other request waits. It prevents the database spike but introduces a different risk: if the rebuild is slow, you have a growing queue of waiting requests.

Probabilistic early expiration is more elegant. Before the TTL actually expires, the system starts refreshing the key using a probability function based on remaining TTL and rebuild cost. The closer to expiry, the higher the probability of an early refresh. Hot keys effectively never go fully cold.

Background refresh takes a different approach entirely. A dedicated worker keeps popular keys warm by refreshing them proactively before they expire. The application never experiences a true miss on these keys.

Ask yourself: What is the peak concurrent traffic on my most accessed cache key? What happens to my database if that key expires right now, at this exact traffic level?

4. Some Things Should Never Be Cached

Knowing what to cache gets most of the attention. Knowing what not to cache is equally important and almost never discussed.

The clearest category is anything transactional or financial. A user's account balance, an order status, a payment confirmation. Think about what it means for a user to see a value that is thirty seconds stale. If they check their balance, see a number that was accurate half a minute ago, and make a financial decision based on it, you have a problem that no performance gain justifies. The rule is simple: if stale data can cause a user to take a wrong action with real consequences, it should not be cached.

The subtler category is highly personalised responses. The risk here is not performance. The risk is that if your cache key does not capture every dimension that makes a response unique to a specific user, you can serve one user's data to a completely different user. Their ID, their role, their tenant, their locale, their feature flags, all of it needs to be part of the key. If any dimension is missing, you have not just a performance problem or a consistency problem. You have a data exposure incident waiting for the right traffic pattern to surface it. This has happened at companies of every size, and the incident report always traces back to a cache key that was not specific enough.

Then there is legally or contractually sensitive content. Terms and conditions, regulated pricing, compliance documentation. Serving an outdated version of any of these is not just a user experience problem. Depending on the industry, it can carry legal weight.

Ask yourself: If this cached value is served sixty seconds after it was written, what is the worst realistic outcome for the user receiving it?

5. Your Eviction Policy Is a Decision, Not a Default

Every cache has a memory ceiling. When it fills up, something has to be removed to make space for new data. The question is what gets removed, and whether that was a deliberate engineering choice or something that just happened because nobody changed the default.

In Redis, the default eviction policy is noeviction. This means when memory is full, Redis stops accepting writes and returns errors to the caller. That is almost certainly not the behaviour you want in a production system under load. But many teams discover this only when they are already in an incident, staring at errors they have never seen before, trying to understand why the application suddenly stopped working.

There are several eviction strategies worth understanding before you choose one. The most commonly used is LRU, which stands for Least Recently Used. It removes the key that has not been accessed for the longest time. This works well in most systems because recent access is generally a good signal that a key will be accessed again soon. LFU, or Least Frequently Used, removes the key with the lowest total access count over time. This suits workloads where access frequency over the long term is a better signal than recency. TTL-based eviction removes the key closest to its natural expiry, which protects longer-lived data from being displaced by short-lived data that arrived at the wrong moment.

The right policy depends on how your system actually accesses data. The wrong policy, or the default left in place because nobody thought to change it, means your cache is making eviction decisions thousands of times per second based on an assumption that has nothing to do with your specific workload.

Ask yourself: Have you explicitly configured your eviction policy? When your cache fills up at peak load, what should be protected and what should go?

6. The Cold Start Problem Nobody Prepares For

You deploy a new version of your application. The new instance comes up with a completely empty cache. For the first several minutes of its life, every request is a miss. Every request goes directly to the database.

In a low-traffic system, this is barely noticeable. In a high-traffic system, or one with a database already operating near capacity, those first few minutes can look exactly like an incident. Monitoring alerts fire. Someone starts investigating. By the time they trace it to the deployment, the cache has warmed up and traffic has stabilised. The post-mortem notes it as "transient" and the team moves on.

Until the next deployment.

Three approaches prevent this. Cache warming on startup means pre-populating your most accessed keys before the new instance starts taking live traffic. To do this effectively you need to know your hot keys, which your observability setup should already be surfacing. Gradual traffic shifting avoids the all-or-nothing switchover: old instances keep serving traffic with their warm caches intact while new instances slowly build up their own state before absorbing the full load. Sticky sessions during rollout routes users to consistent instances temporarily, which limits how many cold instances are simultaneously exposed to real traffic.

Ask yourself: What does your system actually look like in the five minutes immediately after a fresh deployment? Have you ever deliberately tested it?

7. Distributed Caching Is Not Just Single-Node Caching at Bigger Scale

When you move from a single cache server to a distributed cache cluster, the rules change in ways that are easy to miss.

Consider what happens during a write. Your application updates cache node 1. Replication to node 2 is asynchronous and has not completed yet. Another request, routed to node 2, reads that key and gets the old value. Two users, the same request, nearly the same moment, different responses.

This is not a malfunction. It is the expected behaviour of an eventually consistent distributed system. The problem surfaces only when the application is designed assuming strong consistency and the cache is delivering eventual consistency. That mismatch does not produce errors. It produces silent incorrectness, which is harder to find, harder to reproduce, and harder to explain to a stakeholder than a clean exception with a stack trace.

Redis Cluster uses asynchronous replication. Under normal conditions, replication lag is milliseconds and practically invisible. But in failure scenarios, a node going down, a network partition, a failover, writes that were acknowledged can be lost before they propagate. How your application behaves in those moments is a question worth answering before you are in the middle of one, not during it.

Ask yourself: Has your application been designed knowing that cache reads across nodes may not always be consistent? What actually happens to your users if they are not?

8. Security Gaps in Caching Are Invisible Until They Are Not

Security is the least discussed dimension of caching and the one with the most damaging failure mode.

Here is how it goes wrong. You cache a response that contains data belonging to a specific user. Your cache key is derived from the request: the endpoint, the query parameters, maybe a session attribute. A second user sends a request that generates the same cache key. They receive the first user's cached response. Their personal data, their account details, their private information, served silently to someone who should never have seen it.

This is a data breach that produces no exception, no error log, and no anomaly in your performance metrics. The cache is working exactly as designed. The design is the problem.

The fix requires rigorous cache key scoping. Every dimension that makes a response unique to a specific user must be part of the key: user ID, tenant ID, permission level, role, locale, feature flags. Leaving any of these out is not a minor oversight you can patch quietly in the next sprint. It is a live security incident waiting for the right traffic pattern to reveal it.

The second concern is what lives in your cache at rest. Session tokens, access tokens, personally identifiable information embedded in cached API responses. Most teams apply strict access controls to their databases, with audit logs, restricted credentials, and security reviews. Not all of them apply the same rigour to their cache infrastructure. A cache that is overly permissive, inadequately monitored, or not included in your threat model is a data exposure risk that your database equivalent would never be allowed to carry.

Ask yourself: Are all your cache keys scoped to the precise context of the user they serve? If your cache infrastructure were accessed by someone who should not have access, what would they find?

9. If You Are Not Measuring Your Cache, You Do Not Know If It Is Working

This is the simplest point in the article. It is also the one most consistently ignored.

A cache you cannot observe is a cache that is either working fine or silently failing, and you have no way to tell which. Most teams discover their cache has a problem the same way they discover most production problems: through an incident that has already started.

Three numbers tell you almost everything about your cache's health:

Hit rate is the percentage of requests served directly from cache. A high, stable hit rate means the cache is doing its job. A hit rate that is slowly declining over days or weeks is a signal that something has changed: data volatility has increased, TTLs have drifted out of alignment with access patterns, or a deployment changed behaviour somewhere upstream. Without this number, you will not see the decline until it has already affected performance.

Miss rate is the inverse: how often requests fall through to the database. A sudden spike in miss rate is a symptom. It can mean a stampede is in progress, an invalidation pipeline has broken, or a deployment started cold. The cause still needs to be found. But without the metric, you cannot even see that something has changed.

Eviction rate tells you whether your cache is sized correctly for the data you are asking it to hold. A rising eviction rate means your working set is larger than your allocated cache memory. Data is being pushed out before it can be reused, your hit rate will follow it downward, and your database load will follow that upward. The eviction rate is the early warning that precedes both.

Together, these three numbers tell a continuous story. Without them, you are managing critical infrastructure entirely on faith.

Ask yourself: Can you pull up a live view of your cache hit rate, miss rate, and eviction rate right now? If not, that is the first thing to fix before anything else in this list.

10. The Cost Is Real, and It Compounds Quietly

Cloud infrastructure has a way of billing you for decisions made months ago by people who are sometimes no longer on the team.

Caching is one of those decisions. Size your cache too small and the effects cascade: high eviction rates reduce your hit rate, a lower hit rate pushes more load to the database, more database load requires more compute, more compute costs more money. You end up paying across multiple services because one component was under-provisioned. Size your cache too large and you pay for memory that sits idle. Managed Redis on any major cloud provider is not free, and idle capacity is billed at the same rate as active capacity.

The right size comes from understanding your working set, which is the total volume of data your application actually reads within a given time window. If your working set is 15GB and your cache is 4GB, you are not caching 15GB of data. You are repeatedly evicting and re-fetching 11GB of it, paying for the database round trips on every cycle.

The other cost that accumulates quietly is data transfer. If your application instances and your cache cluster live in different availability zones, you pay for cross-zone traffic on every cache read. On a high-traffic system with a high hit rate, that is an enormous number of reads. The per-request cost is small. The monthly total is not, and it tends to surprise people when someone finally looks at the bill in detail.

Ask yourself: Have you sized your cache from working set analysis or from a number someone estimated at the start of the project? Do you know what your cross-zone cache traffic costs per month?

Back to the Booking System

Remember the platform that went down for four minutes? The cache was there. Redis was running. The TTL was set. The engineering team had done everything the basic mental model asks for.

What they had not done was think about point 3: the stampede.

The availability keys for those popular workshops all had the same sixty-second TTL, set at roughly the same time when the workshops were first published. So they all expired together. When the traffic spike hit, every request found a cold cache simultaneously and went straight to the database. The protection collapsed at the exact moment it was needed most.

The fix was not complicated. A background worker refreshing availability keys for popular workshops every forty-five seconds would have kept those keys warm through the entire spike. The database would have seen normal traffic. Users would have seen normal response times. The four minutes of downtime would not have happened.

One decision. Not made. Four minutes down.

That is what production caching actually looks like. Not a performance graph. A decision with a consequence.


The Thing That Ties All of This Together

Caching does not make your system faster.

Done right, it does. Done wrong, it makes your system faster right up until the moment it does not. And when it fails, it tends to fail suddenly, in ways that are difficult to trace back to a decision made quietly, months earlier, on an ordinary afternoon.

Every point in this article is a decision. Not a best practice to bookmark, not a suggestion for the next sprint. A decision with consequences that land in production, on your watch, in front of users.

The engineers who build systems that hold up under real pressure are not necessarily smarter. They are more deliberate. They treat each of these ten things as a conscious choice rather than something that gets handled by default.

Make the choices. Write them down. Revisit them before you ship.


Production Ready Checklist

Go through this before anything involving caching reaches production. Not as a formality. As a genuine engineering checkpoint.

  • [ ] Have I consciously chosen a caching pattern and do I understand its consistency trade-offs?
  • [ ] Do I have a defined invalidation strategy with a clear owner, clear triggers, and handling for silent failures?
  • [ ] Have I protected my hottest cache keys against a stampede event?
  • [ ] Have I audited what I am caching and confirmed none of it is transactional, financial, or dangerous when stale?
  • [ ] Have I explicitly configured my eviction policy rather than accepting the default?
  • [ ] Have I planned and actually tested what happens in the first five minutes after a cold deployment?
  • [ ] Do I understand my cache cluster's replication and consistency model, and has my application been designed with that in mind?
  • [ ] Are my cache keys scoped precisely enough that no response can ever be served to the wrong user?
  • [ ] Do I have live monitoring for hit rate, miss rate, and eviction rate?
  • [ ] Have I sized my cache from a working set analysis and not from a rough estimate?

The True Code of Production Systems is a series on The True Code. Each post covers one production-critical topic, stack-agnostic, with enough depth to actually change how you think about it.