Tuning a proxy cache for hit rate, latency, and origin load
Turning on proxy caching is a one-click decision. Tuning it well is where the leverage actually shows up. A 60% hit rate cuts your origin load by more than half. An 85% hit rate cuts it by almost seven times. The difference between those two numbers is almost never about the cache itself — it's about how you sized it, what you keyed it on, and where you placed it relative to your other ingress policies.
This is a walkthrough of how to think about each knob, with worked numbers.
Step 1: Decide whether to cache at all
The fit test is short:
- The same URL is requested by many different users.
- The response either doesn't change for the duration of a TTL you can tolerate, or you can tag/version it so changes invalidate cleanly.
- Producing the response costs noticeably more than serving a cached copy (CPU, downstream calls, paid third-party requests, LLM tokens).
If all three hold, cache. If any of them fail, the cache either won't get used or will serve the wrong thing.
What doesn't fit:
- Per-user content — the cache key explodes to one entry per user and the hit rate collapses. Reach for application-level memcached instead.
- Live counters, leaderboards, real-time inventory — even a 5-second TTL is wrong some of the time.
- Write endpoints —
POST, PUT, PATCH, DELETE aren't cached.
Step 2: Size the cache to your working set
Pick total cache size per region. The rule of thumb:
working_set = avg_response_size_kb × unique_cacheable_urls
target_size = working_set × 1.5 # headroom for metadata + churn
Worked examples:
- Public read API, 5 KB JSON responses, 50,000 unique URLs → 250 MB working set → start at 512 MB.
- Listing endpoints, 20 KB JSON, 200,000 URLs → 4 GB working set → start at 6 GB.
- Thumbnail proxy, 80 KB images, 30,000 URLs → 2.4 GB → start at 4 GB.
If you can't estimate it, start at 1 GB, run for a day, look at the hit rate, and resize. Resizing is non-disruptive — the cluster picks up the new size on the next deployment cycle.
The mistake to avoid is sizing for "everything we serve." You want the cache to hold the things that get fetched repeatedly, not the long tail of one-off requests. A 1 GB cache that fits your top 1,000 endpoints will outperform an 8 GB cache that thrashes trying to hold the bottom 100,000.
Step 3: Pick TTLs to match how fast your data actually changes
Two TTLs to set:
- Cache TTL (default 300 s) — how long the entry is fresh.
- Storage TTL (optional, must be ≥ cache TTL) — how long it sticks around after going stale.
The split matters. If you set cache TTL to 300 s and storage TTL to 3600 s, then after 5 minutes the entry is "stale" — but if your container is briefly unreachable, the gateway can still serve the stale copy instead of failing the request. That's a free reliability win on top of the cache hit rate.
Some rough starting points:
| Content type | Cache TTL | Notes |
| Reference data (catalogue, region list, plan tiers) | 1800–3600 s | Changes a few times a day at most |
| Public read API responses | 60–300 s | Tight enough that users don't see day-old data |
| Search / listing pages | 30–120 s | Higher write rate, tighter TTL |
| Static-ish assets served via your container | 3600+ s | If you control the version in the URL |
If your container already emits Cache-Control headers, leave Honour Cache-Control on (the default) — your responses keep authority over their own TTL. Turn it off only if you want the platform TTL to win unconditionally for the whole container.
Step 4: Tighten the cache key
The default key is method + path + query. That works for a lot of endpoints. It breaks the moment your responses depend on something else: the user's language, the tenant header, a partial filter in a JSON body.
Extending the key options:
- Headers —
Accept-Language is the classic. A tenant header (X-Tenant, X-Org) is also common. Each extra header dimension multiplies your entry count, so keep this list short.
- Query-param allow-list — by default every query param is part of the key, which means tracking parameters (
utm_*, gclid) fragment your cache. Switch to an allow-list of params that actually change the response (page, limit, category) and the cache consolidates.
- JSON body fields — for
POST-style search endpoints, you can name specific fields from the JSON body that should be part of the key. Everything else is ignored.
The rule: every dimension on the cache key multiplies your entry count by the cardinality of that dimension. A two-language API with five tenants and a query-param allow-list of three is 2 × 5 × 3 = 30× whatever your base unique-URL count is. Plan the entry count, not just the working-set size.
Step 5: Pick a placement relative to rate limiting
You have two choices for where the cache sits in the request path:
- Before rate limit — cache hits don't consume your rate-limit budget. A scraper hitting the same URL 10,000 times in a minute gets 10,000 cached responses and your real users keep their full token budget. This is the right placement for public read endpoints.
- After rate limit — every request, hit or miss, counts. This is the right placement when your rate limit exists to enforce per-user fairness on an authenticated API, not to protect the origin.
If you're not sure which you want, think about what the rate limit is for. Protecting the origin → cache before. Enforcing fairness → cache after.
Step 6: Watch the hit rate and iterate
The cache exposes a hit rate per region. The first interesting question is whether the rate is flat, climbing, or oscillating.
- Climbing for a few hours and then plateauing — the cache is warming. The plateau is your steady-state hit rate.
- Plateau is too low (under 50%) — usually one of two things: the cache is too small for the working set (size up), or the key has too many dimensions (consolidate it).
- Oscillating up and down hour-to-hour — TTL is too short for the access pattern, or your working set is genuinely changing on that cycle. Try a longer cache TTL, possibly with storage TTL as a safety net.
- High hit rate but tail latency hasn't moved — the cache isn't in the hot path. Check that the endpoints you actually care about are returning cacheable methods and status codes.
Resize, retighten the key, or move the cache placement, and watch what happens. None of these changes require an application redeploy.
Three configurations worth copying
Public read API, anti-scrape focus. 1 GB cache, cache TTL 60 s, methods GET, HEAD, statuses 200, 404, vary on Accept-Language, placement before rate limit. Scrapers eat cached responses; real users keep their rate budget.
Authenticated tenant API, per-user fairness. 2 GB cache, cache TTL 300 s, vary on Authorization (or a tenant header), placement after rate limit. Cache hits still count, but a hot endpoint with a few thousand tenants doesn't melt your origin.
Catalogue / product listing. 4 GB cache, cache TTL 1800 s, storage TTL 7200 s, honour Cache-Control on, content types application/json, text/html, placement before rate limit. Long TTL, long storage tail for resilience, full Cache-Control authority for the rare endpoint that needs to invalidate quickly.
What the cache won't fix
A proxy cache is a great answer to a specific problem: identical requests producing identical responses, repeated at scale. It's not an answer to slow database queries on cache misses, expensive per-user work, or a container that falls over under modest load. Cache misses still go through. If 5% of your traffic is genuinely uncacheable and that 5% is what's hurting you, the cache won't help — fix the origin path first.
The hit rate is the score. Anything that moves it up is worth doing. Anything that doesn't, isn't.
Read the full reference: Proxy Caching