Gemini Context Caching: Pricing and Savings Guide
Context caching lets you store large input prefixes and reuse them across multiple API requests, reducing the cost of repeated content. This guide explains exactly how the pricing works, when caching saves money, and how it compares to alternatives like Claude's prompt caching.
What Is Context Caching?
When you use the Gemini API, every request includes input tokens that the model processes before generating a response. If you are sending the same large block of text (a system prompt, a document, a codebase) across multiple requests, you are paying full input price for that repeated content every single time.
Context caching solves this by letting you upload a large prefix once, store it on Google's servers, and reference it in subsequent requests. The cached tokens are read from storage instead of being reprocessed, which costs less than sending them as fresh input.
Common use cases include: chatbots with long system prompts, applications that analyze the same document repeatedly, code assistants working on a fixed codebase, and any workflow where the beginning of each request is identical.
How It Works in Practice
- 1Upload your prefix content (minimum 32,768 tokens) to create a cache entry
- 2Receive a cache ID that references your stored content
- 3Include the cache ID in subsequent requests instead of the full prefix
- 4Pay reduced input price for cached tokens + storage fee per hour
Context Caching Pricing
Gemini context caching has two cost components:
Storage Cost
25% of the standard input price per hour. This is charged for every hour the cache exists, regardless of whether you read from it. Think of it as a rental fee for keeping your content readily available.
Read Discount
25% off the standard input price per read. When you reference cached tokens in a request, those tokens are charged at 75% of the normal input rate. The non-cached portion of your request is still billed at full price.
Standard vs Cached Pricing Per Model
| Model | Standard Input | Cached Read | Storage / Hour | Output |
|---|---|---|---|---|
| Gemini 2.5 Pro | $1.25 | $0.9375 | $0.3125 | $10.00 |
| Gemini 2.5 Flash | $0.15 | $0.1125 | $0.0375 | $0.60 |
| Gemini 2.0 Flash | $0.10 | $0.075 | $0.025 | $0.40 |
All prices per 1M tokens. Output pricing is unaffected by caching.
Break-Even Analysis: When Does Caching Save Money?
Context caching is not always cheaper. The storage fee means you need to make enough cached reads within each hour to offset the rental cost. Let's work through the math.
The Formula
Savings per read = standard_input - cached_read = 25% of input price
Storage cost per hour = 25% of input price per MTok
Break-even reads per hour = storage_cost / savings_per_read = 1 read per hour
The break-even point is surprisingly simple: you need just 1 cached read per hour to cover the storage fee. At exactly 1 read per hour, the savings from the discounted read equal the storage cost, so you break even. Every additional read within that hour is pure savings.
Real Savings Examples
Example 1: Customer Support Bot with 100K Token Knowledge Base
Using Gemini 2.5 Flash, a support bot sends the same 100K token knowledge base with every request. It handles 60 requests per hour.
Without caching: 60 reads x 0.1 MTok x $0.15 = $0.90/hour
With caching:
Cached reads: 60 x 0.1 MTok x $0.1125 = $0.675/hour
Storage: 0.1 MTok x $0.0375/hour = $0.00375/hour
Total: $0.679/hour
Savings: $0.221/hour = $5.30/day = $159/month (24.6% reduction)
Example 2: Code Review Tool with 500K Token Codebase
Using Gemini 2.5 Pro, a code review tool analyzes PRs against a 500K token codebase. Developers make 20 review requests per hour during work hours.
Without caching: 20 reads x 0.5 MTok x $1.25 = $12.50/hour
With caching:
Cached reads: 20 x 0.5 MTok x $0.9375 = $9.375/hour
Storage: 0.5 MTok x $0.3125/hour = $0.15625/hour
Total: $9.531/hour
Savings: $2.969/hour = $23.75/day (8 work hours) = $475/month
Example 3: Low-Volume Research Tool (Where Caching Barely Helps)
Using Gemini 2.5 Pro, a research tool sends a 200K token document with occasional queries, about 2 per hour.
Without caching: 2 reads x 0.2 MTok x $1.25 = $0.50/hour
With caching:
Cached reads: 2 x 0.2 MTok x $0.9375 = $0.375/hour
Storage: 0.2 MTok x $0.3125/hour = $0.0625/hour
Total: $0.4375/hour
Savings: $0.0625/hour = $1.50/day (only 12.5% reduction)
At low request volumes, the storage cost eats into savings. Still cheaper, but the margin is thin.
Minimum Cache Size and Constraints
Google requires a minimum of 32,768 tokens (roughly 25,000 words) to create a cache entry. Content shorter than this must be sent as regular input. This threshold ensures the caching infrastructure overhead is worthwhile relative to the content being stored.
Additional constraints to keep in mind:
- Cache entries have a configurable time-to-live (TTL). Default is 1 hour. You can set shorter or longer TTLs, but storage is billed for the full duration.
- The cached content must be a prefix. You cannot cache content that appears in the middle or end of your prompt. The cache always starts from the beginning of the input.
- Cache entries are model-specific. A cache created for Gemini 2.5 Pro cannot be used with Gemini 2.5 Flash or any other model.
- You can update the TTL of an existing cache without recreating it, which avoids reprocessing the content.
When to Use Context Caching (and When Not To)
Good Candidates
- +Chatbots with large, static system prompts
- +Document Q&A where users query the same document many times
- +Code assistants with a fixed repository context
- +Any workflow with 2+ requests/hour sharing the same prefix
Poor Candidates
- -Short system prompts (under 32K tokens)
- -Prompts that change every request
- -Low-frequency usage (less than 1 request per hour)
- -One-off batch processing jobs
Gemini Context Caching vs Claude Prompt Caching
Anthropic's Claude offers a competing prompt caching feature with a fundamentally different pricing model. Understanding the differences helps you choose the right provider for your caching workload.
| Aspect | Gemini Caching | Claude Caching |
|---|---|---|
| Cache Write Cost | Standard input price (no surcharge) | 25% markup on input price |
| Cache Read Discount | 25% off input price | 90% off input price |
| Storage Cost | 25% of input price/hour | None (auto-eviction after 5 min idle) |
| Minimum Cache Size | 32,768 tokens | 1,024 tokens (Sonnet/Haiku) |
| Cache Duration | Configurable TTL (default 1 hour) | 5 minutes of idle time |
| Best For | Moderate reads, long-lived caches | Frequent reads in short bursts |
The Key Difference
Claude's 90% read discount is far more aggressive than Gemini's 25%. For workloads with many reads per hour, Claude's caching delivers dramatically higher savings. However, Claude's cache auto-evicts after 5 minutes of idle time, making it unsuitable for sporadic access patterns. Gemini's explicit TTL control and hourly storage model gives you predictable costs and works better for less frequent, more spread-out usage patterns. Choose based on your actual access pattern.
Frequently Asked Questions
How much does Gemini context caching save?
Context caching reduces the input token cost by 25% for cached tokens. For Gemini 2.5 Pro, input drops from $1.25 to $0.9375 per million tokens for cached reads. You also pay a storage fee of $0.3125 per MTok per hour (25% of input price). Net savings depend on request frequency: at 10+ reads per hour, you save about 24% on your cached input costs. At just 2 reads per hour, savings drop to about 12%.
What is the minimum cache size for Gemini?
The minimum cache size is 32,768 tokens, which is approximately 25,000 words. If your prefix content is shorter than this, it cannot be cached and must be sent as standard input with every request. For comparison, Claude's minimum cache size is only 1,024 tokens for Sonnet and Haiku models.
How does Gemini caching compare to Claude prompt caching?
The two systems use fundamentally different pricing models. Claude charges a 25% premium to write to the cache but offers a 90% discount on reads. Gemini charges nothing extra for writes but only discounts reads by 25%, plus an hourly storage fee. For frequent reads in short bursts, Claude's caching delivers much higher savings. Gemini's approach works better for long-lived caches with moderate read frequency.
When should I NOT use context caching?
Avoid caching when your prefix content changes frequently (each change invalidates the cache), when you make fewer than 1-2 requests per hour with the same prefix (storage costs eat into savings), when your prefix is under 32,768 tokens (below the minimum), or when your application uses highly dynamic prompts where no content is shared between requests. In these cases, standard input pricing is simpler and potentially cheaper.
Estimate your caching savings with exact token counts.