Token cost visualization showing context window optimization tradeoffs for LLM applications
Engineering

Context Window Economics: The Math Behind LLM Token Optimization

By Agents Squads · · 7 min

TL;DR — Token minimization is the wrong optimization target. Injecting ~870 tokens of upfront context saves 2-3 tool calls and 500+ thinking tokens per session. Optimize for value per token, not minimum tokens.

The Hidden Cost Model

Every token in a context window has a cost—not just monetary, but computational. We cover the broader principles in our guide to context optimization. At $3-15 per million tokens (depending on model and direction), the naive approach is to minimize tokens.

But token minimization is the wrong optimization target — a lesson that becomes even more critical in multi-agent systems where token waste compounds across every agent.

The real question: when does injecting information upfront save more than it costs?

The Basic Math

Consider two approaches to providing an AI agent with project state. Injecting it upfront costs roughly 870 tokens and is instant. Letting the agent discover it via tool calls costs about 920 tokens plus latency. Token cost is roughly equivalent. So why does injection often win?

The hidden factor is relevance rate — how often the injected information actually gets used.

The value of injecting context equals the tokens saved (multiplied by how often the info gets used) minus the tokens spent injecting it. For a status command that costs 870 tokens with 80% session usage, the pure token math comes out slightly negative. But this ignores latency.

Each tool call adds an API roundtrip — typically 200-500ms. The agent also spends “thinking tokens” deciding whether to check state. When you account for these factors, upfront injection often wins despite the token cost.

The Numbers — Injecting state upfront: ~870 tokens, instant. Agent self-discovery via tool calls: ~920 tokens + 200-500ms latency per roundtrip + thinking tokens. Near-identical token cost, but injection wins on total efficiency.

When High-Density Injection Wins

Inject upfront when:

Real example: A session status summary (squad states, recent activity, active goals) gets referenced in nearly every interaction. The 870 tokens of upfront context saves 2-3 tool calls and 500+ thinking tokens per session.

When It Loses

Avoid upfront injection when:

Real example: Full project history (10,000+ tokens) when most sessions only need recent commits. Better to query on demand.

Practical Measurements

We measured actual token costs for common context injections:

Context TypeCharsTokensUse Case
Minimal status~800~200Session hooks (always)
Full status3,493~870Most sessions
Full dashboard9,367~2,340Deep analysis
Project CLAUDE.md8,000~2,000Always relevant
Full codebase index40,000+~10,000Rarely needed upfront

At session start, ~970 tokens of context represents less than 1% of a 200K token window. That’s cheap insurance against discovery overhead.

Our Data — A session status summary (squad states, recent activity, active goals) at ~870 tokens gets referenced in nearly every interaction. That’s less than 1% of a 200K token window — cheap insurance against discovery overhead.

Progressive Density Strategy

The optimal approach isn’t “inject everything” or “inject nothing” — it’s progressive density based on relevance probability.

At the lightest level, always inject squad names, activity flags, and critical state — about 200 tokens with 100% relevance. For most sessions, bump that up to full status, recent goals, and active work at around 870 tokens. Reserve the full load — complete history, all memory, deep context at 2,340+ tokens — for specific deep-analysis tasks.

The key insight: don’t optimize globally — optimize per session type.

The Full Picture

The real calculation accounts for more than raw tokens. You’re also saving latency (each avoided API roundtrip is 200-500ms), thinking tokens (the overhead the agent spends deciding whether to query), and the compound effect across sessions.

For interactive sessions, latency dominates — a 500ms roundtrip feels slow and disrupts flow. For background automation running overnight, token cost dominates instead. The optimization target shifts depending on how the agent is being used.

Applying This to Agent Design

Prompt Engineering

Structure prompts with relevance-aware sections:

## Context (Always Relevant)
{minimal_state}

## Extended Context (If Needed)
{full_state if complex_task else "Use tools to query"}

## Task-Specific Context
{injected only for matching task types}

Tool Descriptions

High-density descriptions for frequently-used tools pay off:

{
  "name": "search_codebase",
  "description": "Semantic search across all source files. Returns top 10 matches with surrounding context. Use for: finding implementations, understanding patterns, locating related code. Prefer over file reads when location unknown."
}

Longer description (~50 tokens) saves thinking tokens deciding which tool to use.

Memory Loading

Load memory progressively:

Session start: Active goals, recent decisions (500 tokens)
On research task: Full topic memory (2,000 tokens)
On complex analysis: Everything relevant (5,000+ tokens)

Don’t load the full knowledge base for a simple commit message.

Key Takeaway — Don’t optimize globally — optimize per session type. For interactive sessions, latency dominates. For background automation, token cost dominates. The optimization target shifts depending on how the agent is used.

The Trap of Token Minimization

Teams optimizing for minimum tokens often create slower, more expensive agents. An agent that doesn’t have context spends tokens deciding what context it needs, burns latency calling tools to discover it, may miss relevant information due to incomplete discovery, and repeats the whole process across sessions.

An agent with appropriate upfront context starts working immediately, references what it needs without tool calls, completes tasks faster with fewer total tokens overall, and maintains coherence across interactions.

The goal isn’t minimum tokens — it’s maximum value per token.

How to Measure This

Track which injected context actually gets referenced — that’s your real usage rate. Count the tool calls your agents make just to learn about their environment. Compare task completion times across different context levels. Include thinking tokens, discovery tokens, and injected tokens in your total cost calculations.

The numbers will tell you where your agents are wasting effort and where a little upfront context goes a long way. Optimize for relevance first, then density.


Note: Token estimates based on Claude tokenization. GPT and other models may vary by 10-20%. The principles apply regardless of specific counts.

Related Reading

Back to Engineering