How we Eliminated Prompt Engineering cover
Engineering

Context Is the Instruction: Why Context Engineering Requires a Layer the Model Cannot Provide

Context engineering can't optimize a working-memory buffer into a knowledge store. The evidence for why enterprise AI needs a separate context layer. A technical position from Nucleus AI. The field has correctly moved from prompts to context. We argue the move is incomplete: the context window is a working-memory buffer, and a growing body of evidence shows it cannot be optimized into the persistent, verified, organization-scoped substrate that reliable enterprise AI requires. That substrate is a separate architectural layer. This is the argument, the evidence, and a controlled observation of the layer in operation.

Raakin Iqbal
··17 min read

Summary

The reliability of an AI system operating over organizational knowledge is determined less by the model than by the quality of context delivered to it. The industry has begun to recognize this, reframing the discipline from "prompt engineering" to "context engineering." We regard that reframing as correct and important. We also regard it as incomplete in a specific, falsifiable way.

Context engineering optimizes the contents of the context window. The context window is a bounded, per-inference working-memory buffer. A convergent body of evidence — vendor-reported benchmarks, peer-reviewed degradation studies, compaction-stability research, and measured inference economics — indicates that no amount of in-window optimization produces persistence, verification, reconciliation, or provenance, because those are not properties the buffer can hold. They are properties of a distinct layer that must sit beneath the model. We call it the context layer (the understanding layer). This piece states the distinction precisely, explains at the mechanism level why length itself degrades performance, grounds the argument in the literature, specifies the four-operation retrieval contract that distinguishes a context layer from retrieval over a vector store, reports a controlled observation of the layer in operation, and offers a falsifiable prediction about how the constraint behaves as models improve.

Two objects the field keeps conflating

We begin with a definition, because most of the disagreement in this area dissolves once two distinct objects are separated.

A context window is the token-bounded buffer a model attends to during a single inference run. It is a property of the model and the run. It scales and resets with the architecture; its function is to hold the working memory of one thought. A context layer is the persistent, organization-scoped substrate that holds verified knowledge, source provenance, trust weights, and reconciled facts across every agent, human, and integration operating inside an institution. It is a property of the organization, not the model; it does not reset; its function is to deliver correct, grounded inputs into whichever context window is active.

These objects differ in scope, lifetime, consumer, and failure mode. The most common architectural error we observe in enterprise deployments is the substitution of the first for the second — treating "more context window" as a path to the guarantees only a context layer can provide. A simple diagnostic separates them: if a system's context window were doubled tomorrow, would it become more reliable on tasks that depend on reconciling conflicting internal sources? Where the answer is no — which, in our experience, is nearly everywhere in the enterprise — the binding constraint is not the window.

Where the field has converged, and where we extend it

The strongest articulation of the shift comes from Andrej Karpathy, and we agree with essentially all of it. The argument for "context engineering" over "prompt engineering" is that, in any industrial-strength LLM application, the substantive work is assembling the right information for the next step — task framing, exemplars, retrieval, tools, state, history, and compaction — and that this assembly is non-trivial engineering, not a casual instruction. Karpathy locates this within what he describes as one small piece of an emerging thick layer of non-trivial software coordinating individual model calls into full applications. We consider that framing correct and the dismissive "wrapper" characterization wrong.

Two elements of his framing are load-bearing for our argument. The first is the mental model: the model is analogous to a CPU and the context window to RAM. The second is the recognition that the real engineering lives in a thick software layer around the model rather than inside it.

The CPU/RAM analogy is, in our view, exactly right — and it contains the entire constraint. RAM is working memory. It is not storage. No production system runs its system of record out of RAM, nor expects RAM to persist, verify, or reconcile state across processes. The implicit proposition in much of the current trajectory — that better in-window assembly plus larger windows yields reliable organizational AI — asks the working-memory buffer to perform the role of a persistent storage-and-verification layer it was never architected to be. Our extension is a single sentence appended to the prevailing view: context engineering is necessary and is not sufficient, because the buffer it optimizes cannot, by construction, become the layer the system actually requires.

The architectural ceiling, in the evidence

The claim that in-window optimization faces a structural ceiling is not rhetorical. Three independent lines of evidence converge on it.

Long-context performance degrades, including on vendor-reported benchmarks. Google's own published model card for Gemini 3.1 Pro reports 84.9% on MRCR v2 8-needle retrieval at the 128K band and 26.3% at 1M — an approximately 58-point decline in the same vendor-reported table. Both figures are self-reported by the developer, which makes the degradation a vendor concession rather than an adversarial finding. It is also representative of a broad independent literature: NoLiMa (Modarressi et al., Adobe Research, arXiv:2502.05167), in which the majority of models claiming 128K support fall below half their short-context baseline by 32K under non-literal matching; and the "lost in the middle" finding of Liu et al. (TACL 2024), in which information positioned mid-window becomes substantially less recoverable than information at the extremes — a U-shaped curve that larger windows do not repair.

The degradation is attributable to length itself, not to retrieval quality. Du et al. (EMNLP 2025, "Context length alone hurts LLM performance despite perfect retrieval," arXiv:2510.05381) report accuracy declining across a range of roughly 13.9% to 85% as context grows under conditions of perfect retrieval — that is, with the correct information demonstrably present in the window. This isolates the variable: the failure persists when retrieval is held perfect, which means it cannot be remediated by improving retrieval. It is a property of operating the buffer at length.

Compaction, the standard remedy when the window saturates, is lossy and non-reproducible. Summarization-based compaction discards information unpredictably and irreversibly. Research on parallel context compaction for long-horizon agents (Penn State, arXiv, 2026) reports run-to-run instability rising with context, with the coefficient of variation in summary output reaching 171.6% at 96K tokens and run-to-run semantic overlap falling below half — meaning two identical runs of the same compaction produce summaries sharing less than 50% of their content. The reproducibility consequence is the salient one: an agent given an identical task compacts differently across runs and produces divergent outputs. A related result on governance decay (arXiv, 2026) makes the failure concrete and safety-relevant: across more than 1,300 episodes, the rate at which agents violated an in-context policy constraint rose from 0% while the constraint remained in full context to roughly 30% after compaction, reaching as high as 59% for some models — the constraint silently dropped from the summary, and the agent proceeded to take the prohibited action. In regulated settings, non-reproducibility of this kind is not a performance characteristic; it is an audit and governance failure by construction.

Frontier-lab engineering writing concedes the underlying shape of this, describing context explicitly as a finite resource subject to diminishing marginal returns and a finite "attention budget" depleted by every token. The economics make the ceiling concrete. The Stanford Digital Economy Lab's published study of agentic token consumption (2026) establishes the independently verifiable figures: agentic workloads consume on the order of a thousand times the tokens of a comparable chat interaction, runs on an identical task vary by up to 30x in total tokens with no corresponding accuracy gain, and frontier models cannot predict their own token usage (self-correlation no higher than ~0.39), systematically underestimating real cost. A widely cited corollary holds that re-sent context accounts for roughly 62% of agent inference spend — the model paid repeatedly to re-read material it has already processed; we cite it as directionally consistent rather than settled, because the specific figure reaches us through a secondary source and is not yet confirmed in the primary publication. Composed with the Du et al. result, the failure is precise and double-charged: an enterprise pays for the tokens that degrade the output, then pays again for the retry. Independent forecasting (Gartner) anticipates that more than 40% of agentic AI projects will be cancelled by the end of 2027 on cost and value grounds. The constraint is not price-per-token, which has fallen sharply; it is the attempt to make one architectural component discharge a function it cannot.

We hold the steelman in view, and it has real substance. Long-context capability is improving quickly: independent trend data (Epoch AI; Stanford HAI AI Index 2026) finds the input length at which top models sustain 80% accuracy rising by more than two orders of magnitude inside a year, and at the 128K band the strongest current models cluster around 84–94% on MRCR. We take this seriously and expect it to continue. It does not relax the argument, for three reasons the same literature supplies. First, the strong retrieval scores are single-needle; production enterprise retrieval is multi-needle and multi-document, where the field's scores fall by a reported 15–40 points and the best 1M multi-needle results sit well below their single-needle equivalents. Second, the Du et al. degradation occurs under perfect retrieval, so it is untouched by retrieval improvement and unhelped by a larger window. Third, and most directly, external structured-memory systems already outperform full-context reading on the benchmarks built for the comparison while consuming a small fraction of the tokens — a result now corroborated by peer-reviewed, non-vendor work (RecMem, ACL Findings 2026), not only by commercial memory vendors. This is the expected outcome if the constraint is architectural rather than capacity-bound: improving the window raises the ceiling that both approaches climb toward; it does not make the window into the layer.

Why length itself degrades performance

It is worth being explicit about the mechanism, because the failure is often misattributed to retrieval and therefore mis-remediated. The degradation is a property of how attention behaves as the token count grows, and it has at least three compounding sources that no amount of in-window curation removes.

The first is dilution of the attention distribution. Self-attention assigns a normalized weight across all tokens in the window; as the sequence lengthens, the same probability mass is divided across more positions, and the marginal salience of any single relevant token falls. This is the formal shape behind the "lost in the middle" result of Liu et al. (TACL 2024): positional bias toward the beginning and end of the window is not a quirk of a particular model but a consequence of how training data and positional encodings interact, and it produces the characteristic U-shaped recall curve in which mid-context facts are recovered worst. Enlarging the window moves the middle further from both anchors; it does not remove the middle.

The second is positional encoding extrapolation. Models trained at one context length and operated beyond it must extrapolate their positional scheme into a regime the weights never saw, and the effective context length — the length at which the model still uses information reliably — is consistently and substantially shorter than the advertised maximum. The independent literature on effective context length places it well below claimed length for most open models; the practical consequence is that "1M context" is a capacity specification, not a reliability guarantee, and the two should never be read as the same number.

The third is the distinction the Du et al. result isolates so cleanly: even holding retrieval perfect and removing all distractor content, raw length degrades reasoning. Their tests replaced irrelevant tokens with non-distracting whitespace and still observed substantial loss — on the order of a 24-point drop on MMLU for one model at 30K tokens, and a 30-point drop on GSM8K for another. This is the finding that should reorganize how the field thinks about the problem: if the harm survives perfect retrieval and zero distraction, the harm is the length, and the remedy cannot live inside the window. It has to be a mechanism that keeps what the model attends to small, structured, and verified — which is a description of a context layer, not a context window.

The compaction failure compounds all three. Once the window saturates and summarization begins, the system is not only operating in the degraded regime above; it is now non-deterministic, because the summarizer's output varies run to run. The combination — degradation from length, plus instability from the remedy for length — is why long-horizon agents drift: each compaction event both loses information and loses it differently, so the agent's working state diverges from its own history in a way that cannot be reconstructed or audited.

A layering problem, not a tuning problem

The structure of the constraint suggests its resolution. Biological cognition does not implement memory, reasoning, and conflict resolution by tuning a single general region harder; it distributes them across differentiated structures, with distinct mechanisms for working memory, consolidation, retrieval, and reconciliation. Capability emerges from the layering.

The prevailing engineering trajectory inverts this: it asks one structure — the transformer and its context window — to perform working memory, long-term storage, retrieval, verification, and reconciliation simultaneously, and seeks to improve the result by optimizing what enters the buffer or by enlarging it. The evidence above indicates this does not converge. The implication is not a more optimized token-packing strategy, nor a transformer variant that internalizes storage and retrieval, but the extraction of those functions into a dedicated layer architected for persistence, verification, and reconciliation, positioned beneath the layer architected for language.

We are precise about scope. The thick layer Karpathy describes has many components — control flow, model dispatch, tool integration, verification, evaluation, guardrails. We do not claim to be all of it. We claim that the substrate underneath all of it — persistent, verified, organization-scoped context — is the component the field has not yet built, and the one to which the failure modes above reduce.

A controlled observation

To test whether a context layer changes outputs in the manner the argument predicts, we run a controlled comparison: identical model, identical settings, identical instruction, evaluated with the context layer present and absent. The following is one such run, reported because it is representative and because the accompanying recording makes it independently inspectable.

We issued a single instruction of twenty-five words — approximately six of which were the organization's name — requesting a one-page investment memorandum and directing the system to draw on what it knew. The instruction contained no role assignment, no formatting specification, and no section schema. By the conventions of context engineering, it is underspecified.

With the context layer present, the system did not retrieve a set of documents and delegate composition to the model — an architecture that would reduce to retrieval in front of a model and would inherit the failure modes above. The layer supplied the structure of the artifact, the relationships among its components, and the organizational conventions governing it: it specified how the artifact should be constructed, not merely what to include. The output accordingly exceeded the generic memorandum scaffold and incorporated material not named in the instruction — the organization's narrative and positioning, the structure of its financing round, and its visual conventions — because the layer encoded the relationship that a memorandum omitting positioning is incomplete.

A second test isolated retrieval behavior. A plain-language query — "what is the company narrative" — returned the precise positioning document, although the terms "narrative" and "positioning" did not appear in that document's text at creation. The match was produced by modeled relationships and meaning rather than lexical overlap.

The retrieval contract: four operations a window cannot perform

The phrase "context layer" is easy to wave at and hard to pin down, so we will be concrete about what the layer returns and how it differs, operationally, from retrieval-augmented generation over a vector store. The distinction is not "we also do retrieval." It is that the unit of delivery is not a ranked list of text chunks but a structured object carrying four guarantees the model can act on. We call these the retrieval contract.

Trust weighting. Every returned item is tagged with a tier — canonical, active, or unverified — and that tier travels with the content into the model's context rather than being discarded at retrieval time. The failure we see most in naive RAG is not that the wrong document was retrieved; it is that a draft, a superseded version, and the authoritative record were all retrieved as undifferentiated text, leaving the model to average across them. A window has no representation for "this sentence is canonical and that one is a stale draft." The layer does.

Reconciliation. When two sources conflict — and in any real organization they constantly do — the layer does not blend them into a plausible median. It surfaces the conflict and resolves precedence by explicit policy: tier first, then recency, then source-attached authority instructions. The model receives a reconciled answer with the conflict noted, not a smoothed one with it hidden. This is the most safety-critical property, because the most dangerous enterprise error is not a visible gap but a confident answer that quietly split the difference between a current policy and a rescinded one.

Spatial expansion of the query. A production query is not a keyword; it is an intent implicating named entities, unnamed-but-adjacent entities, and the workflow it sits inside. Returning only the documents that match the query string misses exactly the connected context the user did not know to ask for — the mechanism that, in the controlled observation, surfaced the positioning document the instruction never named. The layer returns an interconnected subgraph, not a ranked list. Search optimizes for relevance to a string; expansion optimizes for completeness of an intent.

Provenance to ground truth. Every supplied claim traces to a source, with a confidence the downstream system can branch on. This is what makes the output auditable rather than merely fluent, and it converts trust from an assertion into an artifact: a system that cannot say where a claim came from cannot be trusted with a decision that has to be defended later.

None of the four is expressible inside a context window, however large or well-packed, because all four are properties of the relationships between pieces of knowledge and their status over time — and the window is a flat, equal-weight, point-in-time buffer with no representation for status, authority, or relationship. This is why "more context engineering" cannot reach them: the operations require state, and the window is stateless by design.

How the controlled observation maps to the contract

The observation lines up with the literature and the contract directly. The underspecified instruction outperforming an engineered prompt is the prevailing thesis carried to its conclusion: when the layer performs context assembly, the human ceases to be the context-delivery mechanism — the endpoint "context engineering" implies but in-window technique cannot reach. The system specifying construction rather than supplying content is the four operations in combination: trust weighting and reconciliation selecting what was authoritative, spatial expansion surfacing the unrequested positioning document, provenance grounding each element. The complete result from a minimal instruction is the inverse of the degradation and cost findings — a small, structured, verified context rather than a saturated window, avoiding both the Du et al. degradation regime and the re-sent-context expenditure that dominates agentic cost. And the categorical gap between the present-layer and absent-layer runs, model held constant, is the CPU/RAM analogy made literal: the determinant of the output was the substrate feeding working memory, not the processor.

We state the standing more carefully than the demonstration alone would license: the external literature establishes the gap; this observation, and the controlled comparison it belongs to, is our evidence for the fix. We do not conflate the two.

The asymmetry of failure, and a falsifiable prediction

Partition the stack into a working layer — the model and its per-session context management, including the window, agent memory, and compaction — and a context layer as defined above. The two fail asymmetrically.

A weak working layer over a strong context layer fails as underperformance: the system is slow and repetitive but recoverable, and a human can intervene. A weak context layer fails differently: the system produces output that is confidently wrong, internally consistent, and capable of passing automated evaluation, because a capable model smooths coherent reasoning over incoherent inputs. The reasoning competence that defines a strong working layer becomes the mechanism by which contradiction is concealed rather than surfaced.

This yields a prediction we are willing to be held to. As working-layer capability increases, the underperformance failure mode contracts, but the confidently-wrong failure mode does not; it expands, because each increment of model fluency and self-consistency improves the system's capacity to render coherent output from incoherent input. We therefore predict that over the next several model generations, holding context-layer investment fixed, the rate of high-confidence, internally-consistent, factually-grounded errors in enterprise deployments will rise rather than fall, and will rise fastest in the systems with the most capable models. If improving models reduce that error class without a context layer, the argument is wrong. We do not expect that outcome.

Architectural necessity

The layer cannot reside inside the model, and this is forced rather than preferred. A context layer compiled into a model's weights freezes at training time, carries no provenance, cannot reconcile a canonical source against a superseded draft, and must be rebuilt for each successor model. A context layer that is the window inherits every degradation, compaction, and cost result above. The configuration that survives all constraints simultaneously is a layer positioned between organizational data and the model, operating through standard inputs and outputs — without fine-tuning, without training on parameters, without modification to model internals. Such a layer is necessarily model-agnostic, because the functions it performs — persist, verify, reconcile — are precisely the functions no current architecture, transformer or otherwise, performs for itself; the layer must therefore serve a transformer, a state-space model, or a successor architecture without alteration.

The integration surface follows from the same constraint. Because the layer cannot touch model internals, it interposes at the boundary the model already exposes: it sits in front of inference, performs the four contract operations against the organization's knowledge, and returns a structured, verified, minimal context the model reasons over. To the model it looks like an ordinary context payload; to the organization it looks like a service that ingests from the systems of record at Layer 1 and serves whatever consumes inference at Layer 3. This is the infrastructure pattern that has worked everywhere else in computing — a layer exposing a stable contract upward and absorbing heterogeneity downward, so data sources and models each change without the other noticing. A storage engine does not care which application queries it; the context layer does not care which model reasons over it, and the model need not know the layer exists. For an engineering team this makes adoption an integration problem, not a retraining problem — the only operationally viable version, since no enterprise will fine-tune a frontier model every time a policy document changes.

The harder reality, which we state rather than paper over, is that the layer's guarantees are only as strong as the ingestion and tiering beneath them. The contract operations presuppose that sources have been tiered, canonical records distinguished from drafts, and relationships modeled. That work is real and ongoing, and it is where most of the genuine difficulty lives. A context layer does not remove the cost of knowing what is true in an organization; it relocates that cost to a place where it can be done once, audited, and reused, instead of being paid implicitly by every human who hand-assembles a prompt. That relocation is the value — not the cost disappearing.

There is a safety corollary, and it is the strongest form of the argument. Treating safety as a guardrail problem constrains a pattern-matching system and relies on the constraints holding. A system that produces authoritative output without a represented model of what it is reasoning over is unsafe by construction. The operative safety mechanism is the context layer — verified relationships, provenance, and conflict detection — the apparatus that lets a system represent what it is reasoning over. The governance-decay result above is the empirical form of this: a constraint that lives only in the working layer can be summarized out of existence and silently violated, whereas a constraint enforced at the context layer is a property of the substrate and does not depend on surviving a compaction pass. On this reading, context is the unaddressed safety layer.

We have built and operated this layer for three years, primarily in regulated environments that do not tolerate confident error. We have not completed the thick layer and do not claim to have. We have built the substrate beneath it, and the controlled comparison accompanying this piece is the evidence of what that substrate changes.

The field has accepted that the context window is RAM. The remaining step is to stop running the institution out of it.


Nucleus AI is a contextual AI research lab building the understanding layer between organizational data and any AI stack

Related posts