TL;DR: The biggest lever for AI agent performance isn’t the model or the prompt — it’s the context. Context engineering, the practice of curating exactly which tokens reach your model at runtime, is replacing prompt engineering as the critical discipline for building reliable agents. Teams that get this right see 4–32x cost reductions and dramatically better outcomes.
Prompt Engineering Had a Good Run
For two years, the developer world obsessed over prompt engineering. The right system message. The perfect few-shot example. The magic phrase that unlocks better reasoning.
That era is over.
In 2026, the teams shipping production AI agents — Anthropic, Manus, JetBrains, and dozens of startups — have converged on a different bottleneck: what information reaches the model at inference time. Not how you phrase the request. What the model can actually see.
Anthropic’s Applied AI team calls it context engineering: “the strategic curation of tokens available to language models during inference.” Manus, the autonomous agent platform, rebuilt their entire framework four times before landing on this insight. LangChain, Martin Fowler, and Weaviate have all published guides on it in the last few months.
This isn’t a rebrand. It’s a fundamentally different problem.
The Math That Killed Prompt Engineering
Here’s why context matters more than phrasing: transformer attention is quadratic. A context of 100,000 tokens requires exponentially more processing than 10,000. And according to Anthropic’s research, models suffer from “context rot” — as token volume increases, accuracy in recalling any specific piece of information decreases.
More context doesn’t mean better performance. It often means worse.
This creates an engineering problem that no prompt can solve. You need to decide:
- What information the agent needs right now
- When to surface it (pre-loaded vs. just-in-time)
- How to structure it for efficient attention
- What to leave out — often the hardest decision
According to Manus, their agents operate at roughly a 100:1 input-to-output token ratio. For every token the agent generates, it consumes 100 tokens of context. That ratio makes context curation the single highest-leverage optimization available.
The MCP Tax: A Case Study in Bad Context Engineering
The Model Context Protocol (MCP) is the clearest example of what happens when context engineering goes wrong.
MCP connects AI agents to external tools through structured schemas. The problem: those schemas are expensive. A single MCP tool definition costs 550–1,400 tokens. Connect three services — say GitHub, Slack, and Sentry — and you burn 55,000+ tokens before your agent reads a single user message.
A benchmark by Scalekit running 75 head-to-head tests on Claude Sonnet 4 found:
| Metric | CLI | MCP | Difference |
|---|---|---|---|
| Simplest task (check repo language) | 1,365 tokens | 44,026 tokens | 32x |
| Monthly cost at 10,000 operations | $3.20 | $55.20 | 17x |
| Context consumed before work begins | ~80 tokens | 55,000+ tokens | 687x |
One team reported MCP consuming 72% of their 200k context window just loading tool definitions — leaving only 57,000 tokens for actual conversation, reasoning, and output.
Perplexity has publicly moved away from MCP toward traditional APIs and CLI tools, citing context consumption and authentication issues as core problems.
This isn’t an argument against MCP. It’s an argument for intentional context engineering. The protocol itself is fine. Dumping 50 tool definitions into every agent call is not.
Five Principles That Actually Work
Based on what Anthropic, Manus, and the leading agent teams have published, here are the context engineering principles that matter in practice.
1. Start Minimal, Add on Failure
Anthropic’s guidance is blunt: “Test minimal prompts with your best available model first. Add instructions and examples only when failure modes emerge.”
The instinct to front-load context — detailed system prompts, comprehensive tool lists, extensive examples — is almost always wrong. Every unnecessary token competes for attention with the tokens that matter.
2. Design for Cache Hits
Manus discovered that with Claude Sonnet, cached tokens cost $0.30 per million versus $3.00 uncached — a 10x cost reduction. Their key insight: keep prompt prefixes stable. Even a single-token difference invalidates the cache.
This means append-only context construction, deterministic serialization, and explicit cache breakpoints. Architecture your context like you’d architecture a database — with read patterns in mind.
3. Use the File System as Extended Memory
Both Anthropic and Manus converge on this: treat the filesystem as unlimited context. Agents that maintain structured notes (todo lists, progress summaries, architectural decisions) in files and pull them back into context on demand dramatically outperform agents that try to hold everything in the conversation window.
Anthropic’s research showed agents naturally developing maps, tracking objectives across thousands of steps, and maintaining strategic notes — all through file-based memory.
4. Progressive Disclosure Over Pre-Loading
Instead of loading every tool definition upfront, surface capabilities on demand. A CLI --help call costs 50–200 tokens. A full MCP schema costs 55,000+.
The human analogy holds: we don’t memorize entire manuals. We build indexes and retrieve on demand. The best agent architectures do the same — pre-load only critical context, and use tools like grep, glob, and targeted queries for everything else.
5. Keep Errors Visible
Manus calls this one of their most counterintuitive findings: preserving failed actions in context improves agent performance. The instinct is to clean up errors and retry cleanly. But agents that can see what already failed make better decisions about what to try next.
“Error recovery is one of the clearest indicators of true agentic behavior,” according to the Manus team — yet it’s underrepresented in benchmarks that focus on ideal-condition success.
What This Means for Developer Tools
Context engineering isn’t just an LLM optimization technique. It’s reshaping how developer tools are built.
JetBrains Central, announced March 24, is built entirely around this idea. It maintains a “semantic layer” that aggregates context from code, architecture, runtime behavior, and organizational knowledge — feeding agents system-level understanding rather than raw file dumps.
Claude Code implements context engineering through sub-agent architectures: specialized agents explore tens of thousands of tokens but return only 1,000–2,000 token summaries. The lead agent synthesizes results with clean context.
Browser automation tools face this tradeoff directly. Feeding an agent a full page DOM is context-expensive and noisy. The alternative: targeted screenshots combined with accessibility trees that surface only actionable elements. This is the approach used by Playwright MCP’s accessibility-first interaction model and tools like screencli that feed agents compressed visual context (screenshots + ARIA snapshots) rather than raw HTML.
The pattern is consistent: the best tools don’t give agents more information. They give agents the right information.
The 1M Context Trap
Claude Opus 4.6 and Sonnet 4.6 shipped 1M context windows at GA on March 13, 2026 — at standard pricing, no surcharge. It’s tempting to treat this as a solution: just throw everything in.
It isn’t. Larger context windows don’t eliminate context rot. They raise the ceiling but don’t change the physics. An agent with 1M tokens of poorly curated context will underperform an agent with 50K tokens of well-curated context.
The teams winning in 2026 use large context windows as a safety net, not a strategy. The strategy is still curation.
The Bottom Line
If you’re building AI agents and spending time on prompt wording, you’re optimizing the wrong variable. The research from Anthropic, Manus, LangChain, and the broader agent ecosystem all point to the same conclusion:
The smallest set of high-signal tokens wins.
Every context slot is a scarce resource. Every unnecessary token is attention stolen from a useful one. The discipline of choosing what your agent sees — and what it doesn’t — is the defining skill of agent engineering in 2026.
Prompt engineering asked: “How do I phrase this?” Context engineering asks: “What does the model need to see right now, and nothing more?”
The second question is harder. It’s also the one that matters.
FAQ
What’s the difference between prompt engineering and context engineering? Prompt engineering focuses on crafting effective instructions and phrasing. Context engineering manages the entire information state available to the model at runtime — including system prompts, tools, retrieved data, conversation history, and external context. It’s a superset that treats token curation as the primary optimization lever.
Does a larger context window solve context engineering problems? No. According to Anthropic’s research, models suffer from “context rot” where accuracy decreases as token volume increases, regardless of window size. Larger windows provide more room but don’t eliminate the need for curation. Well-curated 50K-token contexts regularly outperform poorly curated 500K-token ones.
How much do MCP tool definitions actually cost in tokens? A single MCP tool definition costs 550–1,400 tokens. Connecting 40 tools burns 55,000+ tokens before the agent processes any user input. Scalekit’s benchmark found MCP costs 4–32x more tokens than CLI alternatives for identical operations, with monthly costs 17x higher at scale.
What’s the most impactful context engineering technique to start with? Start minimal. Anthropic’s recommendation: test with the smallest possible context on your best model, then add information only when you observe specific failure modes. Most teams over-provision context by default, which actively hurts performance.
How does context engineering affect AI agent costs? Dramatically. Manus reports that cache-optimized context reduces per-token costs by 10x ($0.30/MTok vs $3.00/MTok). CLI-based tool access costs $3.20/month at 10K operations versus $55.20 for equivalent MCP calls. Context engineering is simultaneously a performance and cost optimization.