Cost OptimizationAll Gemini Models

Gemini Context Caching: Pricing and Savings Guide

Context caching lets you store large input prefixes and reuse them across multiple API requests, reducing the cost of repeated content. This guide explains exactly how the pricing works, when caching saves money, and how it compares to alternatives like Claude's prompt caching.

What Is Context Caching?

When you use the Gemini API, every request includes input tokens that the model processes before generating a response. If you are sending the same large block of text (a system prompt, a document, a codebase) across multiple requests, you are paying full input price for that repeated content every single time.

Context caching solves this by letting you upload a large prefix once, store it on Google's servers, and reference it in subsequent requests. The cached tokens are read from storage instead of being reprocessed, which costs less than sending them as fresh input.

Common use cases include: chatbots with long system prompts, applications that analyze the same document repeatedly, code assistants working on a fixed codebase, and any workflow where the beginning of each request is identical.

How It Works in Practice

1Upload your prefix content (minimum 32,768 tokens) to create a cache entry
2Receive a cache ID that references your stored content
3Include the cache ID in subsequent requests instead of the full prefix
4Pay reduced input price for cached tokens + storage fee per hour

Context Caching Pricing

Gemini context caching has two cost components:

Storage Cost

25% of the standard input price per hour. This is charged for every hour the cache exists, regardless of whether you read from it. Think of it as a rental fee for keeping your content readily available.

Read Discount

25% off the standard input price per read. When you reference cached tokens in a request, those tokens are charged at 75% of the normal input rate. The non-cached portion of your request is still billed at full price.

Standard vs Cached Pricing Per Model

Model	Standard Input	Cached Read	Storage / Hour	Output
Gemini 2.5 Pro	$1.25	$0.9375	$0.3125	$10.00
Gemini 2.5 Flash	$0.15	$0.1125	$0.0375	$0.60
Gemini 2.0 Flash	$0.10	$0.075	$0.025	$0.40

All prices per 1M tokens. Output pricing is unaffected by caching.

Break-Even Analysis: When Does Caching Save Money?

Context caching is not always cheaper. The storage fee means you need to make enough cached reads within each hour to offset the rental cost. Let's work through the math.

The Formula

Savings per read = standard_input - cached_read = 25% of input price

Storage cost per hour = 25% of input price per MTok

Break-even reads per hour = storage_cost / savings_per_read = 1 read per hour

The break-even point is surprisingly simple: you need just 1 cached read per hour to cover the storage fee. At exactly 1 read per hour, the savings from the discounted read equal the storage cost, so you break even. Every additional read within that hour is pure savings.

Real Savings Examples

Example 1: Customer Support Bot with 100K Token Knowledge Base

Using Gemini 2.5 Flash, a support bot sends the same 100K token knowledge base with every request. It handles 60 requests per hour.

Without caching: 60 reads x 0.1 MTok x $0.15 = $0.90/hour

With caching:

Cached reads: 60 x 0.1 MTok x $0.1125 = $0.675/hour

Storage: 0.1 MTok x $0.0375/hour = $0.00375/hour

Total: $0.679/hour

Savings: $0.221/hour = $5.30/day = $159/month (24.6% reduction)

Example 2: Code Review Tool with 500K Token Codebase

Using Gemini 2.5 Pro, a code review tool analyzes PRs against a 500K token codebase. Developers make 20 review requests per hour during work hours.

Without caching: 20 reads x 0.5 MTok x $1.25 = $12.50/hour

With caching:

Cached reads: 20 x 0.5 MTok x $0.9375 = $9.375/hour

Storage: 0.5 MTok x $0.3125/hour = $0.15625/hour

Total: $9.531/hour

Savings: $2.969/hour = $23.75/day (8 work hours) = $475/month

Example 3: Low-Volume Research Tool (Where Caching Barely Helps)

Using Gemini 2.5 Pro, a research tool sends a 200K token document with occasional queries, about 2 per hour.

Without caching: 2 reads x 0.2 MTok x $1.25 = $0.50/hour

With caching:

Cached reads: 2 x 0.2 MTok x $0.9375 = $0.375/hour

Storage: 0.2 MTok x $0.3125/hour = $0.0625/hour

Total: $0.4375/hour

Savings: $0.0625/hour = $1.50/day (only 12.5% reduction)

At low request volumes, the storage cost eats into savings. Still cheaper, but the margin is thin.

Minimum Cache Size and Constraints

Google requires a minimum of 32,768 tokens (roughly 25,000 words) to create a cache entry. Content shorter than this must be sent as regular input. This threshold ensures the caching infrastructure overhead is worthwhile relative to the content being stored.

Additional constraints to keep in mind:

Cache entries have a configurable time-to-live (TTL). Default is 1 hour. You can set shorter or longer TTLs, but storage is billed for the full duration.
The cached content must be a prefix. You cannot cache content that appears in the middle or end of your prompt. The cache always starts from the beginning of the input.
Cache entries are model-specific. A cache created for Gemini 2.5 Pro cannot be used with Gemini 2.5 Flash or any other model.
You can update the TTL of an existing cache without recreating it, which avoids reprocessing the content.

When to Use Context Caching (and When Not To)

Good Candidates

+Chatbots with large, static system prompts
+Document Q&A where users query the same document many times
+Code assistants with a fixed repository context
+Any workflow with 2+ requests/hour sharing the same prefix

Poor Candidates

-Short system prompts (under 32K tokens)
-Prompts that change every request
-Low-frequency usage (less than 1 request per hour)
-One-off batch processing jobs

Gemini Context Caching vs Claude Prompt Caching

Anthropic's Claude offers a competing prompt caching feature with a fundamentally different pricing model. Understanding the differences helps you choose the right provider for your caching workload.

Aspect	Gemini Caching	Claude Caching
Cache Write Cost	Standard input price (no surcharge)	25% markup on input price
Cache Read Discount	25% off input price	90% off input price
Storage Cost	25% of input price/hour	None (auto-eviction after 5 min idle)
Minimum Cache Size	32,768 tokens	1,024 tokens (Sonnet/Haiku)
Cache Duration	Configurable TTL (default 1 hour)	5 minutes of idle time
Best For	Moderate reads, long-lived caches	Frequent reads in short bursts

The Key Difference

Claude's 90% read discount is far more aggressive than Gemini's 25%. For workloads with many reads per hour, Claude's caching delivers dramatically higher savings. However, Claude's cache auto-evicts after 5 minutes of idle time, making it unsuitable for sporadic access patterns. Gemini's explicit TTL control and hourly storage model gives you predictable costs and works better for less frequent, more spread-out usage patterns. Choose based on your actual access pattern.

Frequently Asked Questions

How much does Gemini context caching save?

Context caching reduces the input token cost by 25% for cached tokens. For Gemini 2.5 Pro, input drops from $1.25 to $0.9375 per million tokens for cached reads. You also pay a storage fee of $0.3125 per MTok per hour (25% of input price). Net savings depend on request frequency: at 10+ reads per hour, you save about 24% on your cached input costs. At just 2 reads per hour, savings drop to about 12%.

What is the minimum cache size for Gemini?

The minimum cache size is 32,768 tokens, which is approximately 25,000 words. If your prefix content is shorter than this, it cannot be cached and must be sent as standard input with every request. For comparison, Claude's minimum cache size is only 1,024 tokens for Sonnet and Haiku models.

How does Gemini caching compare to Claude prompt caching?

The two systems use fundamentally different pricing models. Claude charges a 25% premium to write to the cache but offers a 90% discount on reads. Gemini charges nothing extra for writes but only discounts reads by 25%, plus an hourly storage fee. For frequent reads in short bursts, Claude's caching delivers much higher savings. Gemini's approach works better for long-lived caches with moderate read frequency.

When should I NOT use context caching?

Avoid caching when your prefix content changes frequently (each change invalidates the cache), when you make fewer than 1-2 requests per hour with the same prefix (storage costs eat into savings), when your prefix is under 32,768 tokens (below the minimum), or when your application uses highly dynamic prompts where no content is shared between requests. In these cases, standard input pricing is simpler and potentially cheaper.

Estimate your caching savings with exact token counts.

Cost Calculator More Cost Reduction Tips