TL;DR — Minimizing tokens is the wrong goal. The right metric is value per token — and when you include human rework costs, “cheap” agents often cost 4-5x more than context-rich ones. Model arbitrage, prompt caching, and outcome-based measurement turn AI from a cost center into a value center.
Why Cheaper Agents Cost More
The instinct is natural: tokens cost money, so minimize tokens. This thinking is wrong, and it took us months of running AI agents in production to understand why. Token minimization optimizes for the wrong metric. The right question isn’t “how few tokens can we use?” It’s “how much value can we create per token?” That reframing changes everything about how you architect, budget, and scale AI operations.
Consider two agents solving the same problem. Agent A is token-minimized: it uses 10,000 tokens at $0.15, achieves a 60% success rate, and requires rework on 40% of tasks. Agent B is context-rich: it uses 25,000 tokens at $0.375, achieves a 95% success rate, and requires rework on just 5% of tasks.
On a per-token basis, Agent A looks cheaper. But when you factor in rework, the picture shifts. Agent A’s effective cost per successful outcome is about $0.35, and Agent B’s lands around $0.42. Still close. But here’s where the real economics hit: human time.
Every time Agent A fails, someone spends 15 minutes reviewing and correcting the output. At $100/hour engineering cost, that’s $10 per rework event. Suddenly Agent A’s true cost jumps to $4.35 per outcome, while Agent B stays at $0.92. The “cheap” agent costs 4.7x more.
We see this pattern constantly. Token cost is noise. Outcome cost is signal. The teams that obsess over reducing API bills while ignoring rework cycles and human intervention are optimizing a rounding error.
Our Data — A token-minimized agent at 60% success costs $4.35 per outcome (including human rework at $100/hr). A context-rich agent at 95% success costs $0.92. The “cheap” agent is 4.7x more expensive.
Measuring Value Per Token
The metric that actually matters is straightforward: business outcome value divided by tokens consumed. Once you start tracking this, the economics become obvious.
Take a customer service agent that resolves tickets autonomously. A resolved ticket is worth roughly $50 in avoided human agent cost. The agent consumes about 15,000 tokens per resolution, costing $0.15 at $10 per million tokens. That works out to $0.0033 of value generated per token consumed — a 220x return on every token spent.
When value per token is this high, the optimization target flips entirely. You don’t want fewer tokens. You want more throughput. You want that agent handling as many tickets as possible, because every token it consumes creates outsized value. The same logic applies to development agents, research agents, and operations agents — once you’ve validated that an agent produces reliable value, the constraint isn’t cost, it’s how much work you can route through it.
This is the mindset shift from cost center to value center. We stopped asking “how do we spend less on AI?” and started asking “how do we route more work through AI?”
The Numbers — A customer service agent resolving $50 tickets at $0.15 per resolution generates $0.0033 of value per token consumed — a 220x return. At that ratio, you want more throughput, not fewer tokens.
The Model Arbitrage Strategy
Not every token needs to come from the most expensive model. The cost spread across model tiers is enormous — Claude Opus 4.5 runs $15 per million input tokens and $75 per million output tokens, while Claude Haiku 3.5 costs just $0.25 and $1.25 respectively. Claude Sonnet 4 sits in the middle at $3 and $15. That’s a 60x cost difference between the cheapest and most expensive option.
The smart play is model arbitrage: use expensive models for high-value decisions and cheap models for everything else. This isn’t about cutting corners. It’s about matching model capability to task complexity, so you’re not paying for reasoning power you don’t need.
We run this in our code review pipeline. Haiku scans every file for obvious issues like syntax errors and formatting problems, costing about a tenth of a cent per file. Sonnet reviews the flagged files for logic and architecture issues at roughly two cents per file. Opus only touches the security-critical code at about ten cents per file.
For a 100-file review, this tiered approach costs $1.20 total — Haiku processes all 100 files for $0.10, Sonnet reviews the 30 that got flagged for $0.60, and Opus deep-dives the 5 security-sensitive ones for $0.50. Running Opus on all 100 files would cost $10.00. Same quality where it matters, 8x less spend. The key insight is that most work is routine. You only need the expensive model for the 5% that genuinely requires deep reasoning.
Our Data — Our tiered code review pipeline: Haiku ($0.10) + Sonnet ($0.60) + Opus ($0.50) = $1.20 for 100 files. Running Opus on everything: $10.00. Same quality where it matters, 8x cheaper.
Batch Processing and Prompt Caching
How you consume tokens matters as much as which model you choose. The economics of batch versus interactive processing are dramatically different.
Interactive use cases — customer-facing chat, real-time copilots — carry a latency premium. Someone is waiting, so you pay full price for speed. But a surprising amount of AI work doesn’t need real-time response. Nightly report generation, code analysis, data processing, content moderation queues — these can all run in batch mode with aggressive prompt caching.
Anthropic’s prompt caching reduces input token costs by 90% for repeated context. For batch operations with consistent system prompts, this is transformative. A nightly batch job running a million tokens of analysis drops from $15 to $1.50 per run. Over a month, that’s $405 in savings on a single job. We cache everything we can — system prompts, reference documentation, style guides, few-shot examples — and it consistently cuts our effective token rate by 5-8x on batch workloads.
The compounding effect is significant. Combine model arbitrage with prompt caching on batch workloads, and you can often achieve 20-40x cost reduction compared to naively running everything through a frontier model in real-time. That’s the difference between AI operations that scale and ones that hit a budget wall at the first sign of growth.
Key Takeaway — Prompt caching cuts input costs by 90%. Combined with model arbitrage and batch processing, you can achieve 20-40x cost reduction vs. running a frontier model in real-time.
Budgeting AI Like Infrastructure
We structure AI spending into three categories, and tracking them separately has been essential for understanding where the money actually goes.
Fixed costs are the predictable baseline: monitoring agents on scheduled runs, batch processing for daily and weekly reports, integration maintenance. We budget these like server costs. They should decrease as a percentage of value over time as the agents mature and handle more work without growing proportionally in token consumption.
Variable costs scale with business activity: customer-facing agent interactions, developer assistance per engineer, research tasks per investigation. These should increase as the business grows, because more interactions mean more cost but also more value delivered. If variable costs are rising but outcomes are flat, something is broken — and that signal has saved us from pouring money into underperforming agents more than once.
Investment costs are capability building: generating training data, developing new agents, experimenting with fine-tuning. We treat these as R&D with payback measured in quarters, not days. The temptation is to cut investment costs when budgets get tight, but that’s how you end up running the same mediocre agents a year from now.
Watching for Cost Anomalies
Running agents in production, we’ve learned which cost patterns signal real problems versus normal variation.
A sudden 10x spike almost always means an agent is stuck in a loop — kill it and investigate. Gradual daily increases usually indicate context bloat, where prompts accumulate unnecessary history and need pruning. High variation between runs for the same task type points to inconsistent inputs that need standardization. And the most insidious pattern: costs rising while outcomes stay flat, which means the model or prompt architecture needs re-evaluation.
Knowing when not to optimize is equally important. If outcomes are excellent, cost is within budget, and the system is stable, don’t touch it. We’ve wasted engineering hours chasing marginal token savings on agents that were already delivering exceptional value. The general triggers we use: optimize when cost per outcome has been rising for three or more months, when token efficiency is less than half of comparable tasks, or when human intervention rate exceeds 20%. Outside of those conditions, leave it alone. The optimization urge is real, but it’s often a distraction from building new capabilities that generate more value.
Multi-Agent Cost Dynamics
For multi-agent systems, costs compound in ways that aren’t immediately obvious. Beyond the raw token consumption of each agent, you pay for coordination overhead: every handoff between agents costs 500 to 2,000 tokens for context transfer, every routing decision costs 200 to 500 tokens, and every result summarization adds another 300 to 1,000 tokens.
In a five-agent workflow with four handoffs, coordination overhead alone runs 4,000 to 10,000 tokens. This has a direct design implication: fewer, more capable agents often beat many specialized agents on cost efficiency.
That said, multi-agent systems justify their overhead in specific scenarios. For complex research requiring multiple perspectives, the quality improvement exceeds the coordination cost. For time-critical work, parallel agent execution beats sequential processing. For high-stakes decisions, having multiple agents cross-check each other is worth every coordination token. But for simple, well-defined tasks, a single capable agent almost always wins on both cost and latency.
The break-even point is clear: multi-agent systems earn their keep when the quality or speed improvement exceeds the coordination overhead. We track this ratio explicitly and consolidate agents when it drops below 1.5x.
Pricing AI Products for Value
If you’re building products powered by AI, token economics directly shape your pricing strategy, and getting this wrong can kill your margins or your growth.
Cost-plus pricing — marking up your token costs by some multiplier and adding fixed costs — is the simplest approach but the weakest. It gives you predictable margins, but it doesn’t capture the value you’re delivering. Worse, it makes you vulnerable to model price drops undercutting your pricing logic. We’ve seen startups build entire revenue models on cost-plus, only to watch their margins evaporate when Anthropic or OpenAI dropped prices.
Value-based pricing aligns your price with customer outcomes. If your agent saves a customer $500 per month in labor costs, charging $100 per month captures value rather than marking up costs. This enables much higher margins, but it requires you to actually measure the value delivered.
Outcome-based pricing is the boldest approach: the customer pays per result. Per resolved ticket, per generated report, per completed analysis. You bear the efficiency risk, but you also send the strongest trust signal. This is how Sierra built to $100 million in ARR — charging per resolved customer service ticket. They capture value, not cost. When your agents are good enough, outcome-based pricing creates a flywheel: better agents mean lower cost per outcome, which means higher margins at the same price.
The Bottom Line
We track five things that tell us whether our token economics are healthy: cost per successful outcome (should decrease over time), value per token (should increase), model efficiency across tiers (ensures our arbitrage strategy is working), retry rate (target under 10%), and human escalation rate (target under 5%).
The principles behind all of this are simple. Optimize for value, not cost. Use expensive models only for high-value decisions. Batch and cache aggressively. Measure outcomes, not token counts. And always, always include human time in your cost calculations — token costs are often less than 5% of the true cost when you factor in human intervention.
The goal isn’t to minimize tokens. It’s to maximize value per token.
Related: Context Window Economics covers when to inject context vs discover via tools. Profitable AI analyzes companies successfully monetizing AI products.