Context Engineering for Multi-Agent Systems

TL;DR — Multi-agent systems burn 15x more tokens than single agents, and performance degrades well before the context window fills up. Keep agents in the 30-40% “Smart Zone,” use role-specific budgets, and treat handoffs as the highest-leverage optimization point.

The Context Problem Nobody Warns You About

Andrej Karpathy nailed it: the LLM context window is RAM, not a filing cabinet. You curate it carefully or you pay for it in degraded performance.

This insight becomes urgent once you move beyond a single agent. Anthropic’s research shows multi-agent architectures can consume 15x more tokens than single-agent approaches. Without deliberate context engineering, costs spiral, outputs degrade, and agents start hallucinating off each other’s noise. The goal was never to fill context windows — it’s to find the smallest possible set of high-signal tokens that maximize the outcomes you care about.

We learned this the hard way running 45 agents across 8 squads in our squads architecture. Context problems don’t announce themselves politely. They show up as a research agent that slowly gets dumber over a long run, or an orchestrator that starts giving contradictory instructions to its sub-agents, or a synthesis task that quietly drops a key source because the model lost track of it in a sea of tokens.

The failure modes have names. Context rot is when model performance degrades as token count creeps up — we measured a 23% accuracy drop in one research agent operating above 60% context utilization. Context poisoning happens when a hallucination enters stored information and compounds over time as other agents consume it. Context distraction and context confusion are cousins: too much information overwhelms the model, and irrelevant content starts influencing responses in subtle ways. Context clash — conflicting information within the same window — rounds out the set. These aren’t theoretical. We’ve hit every single one in production.

Our Data — Running 45 agents across 8 squads, we measured a 23% accuracy drop in a research agent operating above 60% context utilization. Context problems don’t announce themselves — they show up as agents that slowly get dumber.

The 40% Smart Zone

Claude Code triggers auto-compact at 95% context saturation, but performance degrades long before that threshold. Based on our testing, the relationship between context utilization and output quality follows a surprisingly predictable curve.

Below 30% utilization, reasoning quality stays optimal — the model has room to think. Between 30% and 40% is what we call the Smart Zone: a good balance of loaded context and reasoning headroom. Push past 40% and you start seeing noticeable degradation on complex tasks. By 60-80%, quality loss is significant and hallucinations increase measurably. Above 80%, outputs become unreliable. And if you hit 95%, auto-compact fires and you lose context entirely — often the wrong context.

The practical recommendation is simple: design your agents to live in the Smart Zone. Build in checkpoints and summarization triggers well before hitting 40%. If an agent routinely operates above that line, it’s a design problem, not a runtime problem. Either the agent is trying to do too much, or it’s accumulating context it doesn’t need.

Key Takeaway — Design your agents to live in the 30-40% Smart Zone. If an agent routinely operates above that line, it’s a design problem, not a runtime problem.

Four Techniques for Managing Context

Anthropic’s guidance organizes context management into four categories. We’ve found this framework genuinely useful — not as a checklist, but as a mental model for diagnosing and fixing context problems.

Externalizing context means persisting information outside the window and retrieving it when needed. This is closely related to building robust memory systems for AI agents. Scratchpads, long-term memory files, todo lists, notes files — all variations on the same idea. A research agent investigating 10 sources shouldn’t keep all findings in working memory. It should write each source analysis to a file and keep only a 2-3 sentence summary in context. That’s roughly 500 tokens instead of 15,000. The discipline is writing things down early and often, treating context like expensive short-term memory rather than cheap storage.

Selecting context is about fetching only what’s relevant at runtime. Load your CLAUDE.md files upfront because they’re always relevant. Use file search for discovery, then read only what you need. Don’t pre-load anything “just in case.” RAG on tool descriptions showed a 3x improvement in tool selection accuracy for agents with 20+ available tools — proof that giving the model less but more targeted information beats giving it everything.

Compressing context means condensing information while preserving critical decisions. Summarize after completing subtasks in 2-3 sentences. Drop tool outputs after extracting conclusions. Keep decisions and rationale, discard raw data. The trap here is over-aggressive compression: strip too much and you lose subtle but important nuance. Always test your compression strategy against representative tasks before deploying it broadly.

Isolating context through sub-agents gives each specialist a clean, focused window. We’ve seen splitting complex research across sub-agents — each with isolated context — deliver 90% improvement on multi-source synthesis tasks compared to single-agent approaches. The key rule: sub-agents return summaries of 1,000 to 2,000 tokens, never full results. Each focuses on one concern. The parent coordinates without duplicating work.

These four techniques compound. An agent that externalizes its working memory, selects only relevant inputs, compresses between steps, and delegates deep work to isolated sub-agents can operate dramatically below the 40% line while producing better results than an unoptimized agent filling its entire window.

Budgeting by Agent Role

Not every agent needs the same context allocation. We’ve settled on role-specific budgets that reflect how different agent types actually use their windows.

Monitors — agents that fetch data on a schedule — should stay below 20% context utilization with a budget of $0.50-1.00 per run. They start fresh every time, pull targeted data, and produce structured reports. There’s no reason for them to accumulate anything. Analyzers synthesize upstream data and need slightly more room, targeting under 30% at $1.00-2.00 per run, reading at most 5 previous reports. Generators — agents that create artifacts like documents or code — get the most context headroom at under 40% and $2.00-5.00 per run because creation genuinely requires more loaded context. Orchestrators should stay lean at under 25% and $2.00-3.00; their job is to coordinate, not accumulate. Reviewers operate on bounded inputs (a diff plus a set of rules) and target under 30% at $1.00-2.00.

The budgets aren’t arbitrary. They emerged from watching agents fail when they exceeded these ranges and succeed when they stayed within them. When an agent consistently blows its budget, that’s a signal to redesign it — split it into sub-agents, tighten its inputs, or externalize its intermediate state.

The Numbers — Externalizing a research agent’s findings: ~500 tokens instead of ~15,000. RAG on tool descriptions: 3x improvement in tool selection accuracy. Isolated sub-agents on multi-source synthesis: 90% improvement over single-agent approaches.

Handoffs: The Make-or-Break Moment

When context passes between agents, the quality of the handoff determines everything downstream. We’ve seen more multi-agent failures rooted in bad handoffs than in any other single cause.

A good handoff is minimal viable context: the task description, constraints, a brief context summary, and the expected output format. That’s roughly 150 tokens. A bad handoff dumps the full conversation history, all files read, and everything tangentially related — 18,000 tokens that poison the sub-agent’s context before it even starts working. The sub-agent now has to reason through a swamp of irrelevant information to find its actual task.

The rule is: pass conclusions, not raw data. Specify constraints clearly. Define the expected output format. Limit scope to a single concern. If you find yourself writing a handoff longer than 2,000 tokens, you’re probably asking the sub-agent to do too much or not summarizing the upstream work adequately.

Spotting Trouble Early

Context problems have warning signs that show up before performance craters. When an agent reads 5+ files without producing output, when tool outputs accumulate without summarization, when conversation turns stretch past 10 on the same task, or when large search results get fully loaded into context — the agent is drifting toward the danger zone.

In the 30-40% range, the right response is to pause, summarize current state, and consider whether a sub-agent should take over the deep work. Past 40%, stop accumulating immediately. Summarize what you have, spawn a fresh sub-agent with only the summary, or trigger a manual checkpoint. The instinct to “just finish this one more step” before cleaning up context is almost always wrong — each additional step with a bloated window degrades the quality of every subsequent step.

Five anti-patterns show up repeatedly across teams building multi-agent systems. Context hoarding — reading files “just in case” — is the most common. It feels responsible but it’s actively harmful. Only read what you need right now. History dependency, relying on “what we discussed earlier” rather than stating it explicitly or writing it to a file, creates fragile agents that break when context shifts. Output verbosity, including full file contents in responses instead of summaries with file references, bloats handoffs unnecessarily. Tool output accumulation, running many tools without processing results between calls, fills the window with raw data that should have been summarized. And bloated tool sets with overlapping functionality force the model to waste context reasoning about which tool to use. Keep tool sets minimal and unambiguous.

Key Takeaway — A good handoff is ~150 tokens of minimal viable context. A bad handoff is ~18,000 tokens that poison the sub-agent before it starts working. Pass conclusions, not raw data.

The Economics of Getting This Right

Context engineering isn’t just a quality play — it’s a cost play that compounds across every agent in your system.

At current rates of $3-15 per million tokens, an agent running at 80% context uses twice the tokens of one operating at 40%. In a multi-agent system, that multiplier applies across every agent, and inefficient handoffs compound costs further. A well-engineered system operating in the Smart Zone can cost 60-70% less than an equivalent unoptimized system while producing better results.

Track four metrics to stay honest: average context utilization (target under 40%), cost per outcome (should trend downward), handoff token size (under 2,000 tokens), and compression ratio on tool outputs (aim for 10:1). If these numbers are moving in the right direction, your context engineering is working. If they’re not, the techniques in this article tell you exactly where to look.

The 40% Smart Zone isn’t just optimal for reasoning — it’s optimal for economics. Context engineering is the discipline of curating tokens, not maximizing them. Multi-agent systems make this critical, and they make the payoff substantial.

Sources: Anthropic Engineering (Effective Context Engineering for AI Agents, 2025), LangChain (Context Engineering for Agents), Chroma Research (Context Rot), internal analysis of 45 agents across 8 squads.