Most Agent Failures Aren't Model Failures — They're Context Failures

The most counterintuitive operational fact about AI agents in 2026 is this: giving a model more information often makes it worse. Chroma Research demonstrated it directly. In its Context Rot study, the team evaluated 18 frontier models — including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 — and found that model reliability decreases significantly as input length grows, “often in surprising and non-uniform ways,” even on tasks as simple as retrieving or replicating a piece of text. The intelligence is there. The ability to use it reliably erodes as the context window fills.

The Pattern

For two years the industry treated the prompt as the unit of work. The reframing now underway is that the prompt is just one slice of a much larger object — the entire information environment an agent operates inside. Anthropic’s engineering team calls this context engineering: the discipline of curating what goes into the context window, what gets removed, when to compress an accumulating message history, which documents to retrieve just-in-time, and how to route sub-tasks to isolated sub-agents so each one works against a clean, minimal context rather than a bloated shared one.

This matters most precisely where agents are supposed to earn their keep: long-horizon work. As practitioners building long-horizon agents have documented, tasks that run for tens of minutes to several hours — a large codebase migration, a multi-stage research project — routinely generate more tokens than any context window holds. Without deliberate context management, the agent’s accumulated history becomes the very thing that degrades its judgment. The failure looks like the model “getting confused” late in a task. The cause is that no one engineered what the model was allowed to remember.

The operational takeaway from this body of work is blunt: most agent failures in production today are not model failures, they are context failures. Swapping in a smarter model does not fix a context that has rotted; it just rots more expensively.

Why It Matters

For anyone running agents in production, this redraws where engineering effort should go. The instinct — “the agent is unreliable, we need a better model” — is usually wrong and always expensive. The higher-leverage move is to treat context as a first-class system with its own architecture: a deliberate policy for what each agent sees, an explicit compaction step that summarises and discards stale history before it poisons the next turn, and sub-agent isolation so a narrow task is not reasoning across the entire accumulated state of a project.

For builders evaluating vendors, “what model do you use?” is now a weak diligence question. The sharper one is “how do you manage context over a long task?” A team that cannot answer that crisply is shipping agents that will degrade in exactly the multi-step workflows enterprises pay the most for.

The Charaka View

We run this discipline on ourselves, not as theory. Manthan’s operating memory is deliberately tiered — a lean core that stays in context, with heavier reference material loaded on demand and compacted on a schedule when it grows past a threshold — for the precise reason the Chroma data predicts: an unbounded, ever-growing context does not make an agent wiser, it makes it less reliable. Our multi-lens assessment architecture compounds the same benefit, because routing a company through independent lenses gives each one a clean, bounded context instead of one analyst reasoning across an overloaded window. The lesson the field is converging on is one we treat as an operational invariant: the scarce resource in an agentic system is not intelligence. It is attention — and attention has to be engineered.

This analysis draws on Chroma Research’s Context Rot study (Jul 2025), Anthropic’s guide to effective context engineering for AI agents, and EPAM’s field notes on long-horizon agents in production. Human editorial oversight applied.

This analysis is informational and does not constitute investment advice, a research report, or a recommendation to buy, sell, or hold any security.

Charaka Notes by Manthan Intelligence. Subscribe

The Pattern

Why It Matters

The Charaka View

Never miss an insight