Why most enterprise AI coding pilots underperform (Hint: It’s not the model)

Gen AI in software engineering has moved well beyond autocomplete. The emerging frontier is agentic coding: AI systems capable of planning changes, executing them across multiple steps and iterating based on feedback. Yet despite the excitement around “AI agents that code,” most enterprise deployments underperform. The limiting factor is no longer the model. It’s context: The structure, history and intent surrounding the code being changed. In other words, enterprises are now facing a systems design problem: They have not yet engineered the environment these agents operate in.

The shift from assistance to agency

The past year has seen a rapid evolution from assistive coding tools to agentic workflows. Research has begun to formalize what agentic behavior means in practice: The ability to reason across design, testing, execution and validation rather than generate isolated snippets. Work such as dynamic action re-sampling shows that allowing agents to branch, reconsider and revise their own decisions significantly improves outcomes in large, interdependent codebases. At the platform level, providers like GitHub are now building dedicated agent orchestration environments, such as Copilot Agent and Agent HQ, to support multi-agent collaboration inside real enterprise pipelines.

But early field results tell a cautionary story. When organizations introduce agentic tools without addressing workflow and environment, productivity can decline. A randomized control study this year showed that developers who used AI assistance in unchanged workflows completed tasks more slowly, largely due to verification, rework and confusion around intent. The lesson is straightforward: Autonomy without orchestration rarely yields efficiency.

Why context engineering is the real unlock

In every unsuccessful deployment I’ve observed, the failure stemmed from context. When agents lack a structured understanding of a codebase, specifically its relevant modules, dependency graph, test harness, architectural conventions and change history. They often generate output that appears correct but is disconnected from reality. Too much information overwhelms the agent; too little forces it to guess. The goal is not to feed the model more tokens. The goal is to determine what should be visible to the agent, when and in what form.

The teams seeing meaningful gains treat context as an engineering surface. They create tooling to snapshot, compact and version the agent’s working memory: What is persisted across turns, what is discarded, what is summarized and what is linked instead of inlined. They design deliberation steps rather than prompting sessions. They make the specification a first-class artifact, something reviewable, testable and owned, not a transient chat history. This shift aligns with a broader trend some researchers describe as “specs becoming the new source of truth.”

Workflow must change alongside tooling

But context alone isn’t enough. Enterprises must re-architect the workflows around these agents. As McKinsey’s 2025 report “One Year of Agentic AI” noted, productivity gains arise not from layering AI onto existing processes but from rethinking the process itself. When teams simply drop an agent into an unaltered workflow, they invite friction: Engineers spend more time verifying AI-written code than they would have spent writing it themselves. The agents can only amplify what’s already structured: Well-tested, modular codebases with clear ownership and documentation. Without those foundations, autonomy becomes chaos.

Security and governance, too, demand a shift in mindset. AI-generated code introduces new forms of risk: Unvetted dependencies, subtle license violations and undocumented modules that escape peer review. Mature teams are beginning to integrate agentic activity directly into their CI/CD pipelines, treating agents as autonomous contributors whose work must pass the same static analysis, audit logging and approval gates as any human developer. GitHub’s own documentation highlights this trajectory, positioning Copilot Agents not as replacements for engineers but as orchestrated participants in secure, reviewable workflows. The goal isn’t to let an AI “write everything,” but to ensure that when it acts, it does so inside defined guardrails.

What enterprise decision-makers should focus on now

For technical leaders, the path forward starts with readiness rather than hype. Monoliths with sparse tests rarely yield net gains; agents thrive where tests are authoritative and can drive iterative refinement. This is exactly the loop Anthropic calls out for coding agents. Pilots in tightly scoped domains (test generation, legacy modernization, isolated refactors); treat each deployment as an experiment with explicit metrics (defect escape rate, PR cycle time, change failure rate, security findings burned down). As your usage grows, treat agents as data infrastructure: Every plan, context snapshot, action log and test run is data that composes into a searchable memory of engineering intent, and a durable competitive advantage.

Under the hood, agentic coding is less a tooling problem than a data problem. Every context snapshot, test iteration and code revision becomes a form of structured data that must be stored, indexed and reused. As these agents proliferate, enterprises will find themselves managing an entirely new data layer: One that captures not just what was built, but how it was reasoned about. This shift turns engineering logs into a knowledge graph of intent, decision-making and validation. In time, the organizations that can search and replay this contextual memory will outpace those who still treat code as static text.

The coming year will likely determine whether agentic coding becomes a cornerstone of enterprise development or another inflated promise. The difference will hinge on context engineering: How intelligently teams design the informational substrate their agents rely on. The winners will be those who see autonomy not as magic, but as an extension of disciplined systems design:Clear workflows, measurable feedback, and rigorous governance.

Bottom line

Platforms are converging on orchestration and guardrails, and research keeps improving context control at inference time. The winners over the next 12 to 24 months won’t be the teams with the flashiest model; they’ll be the ones that engineer context as an asset and treat workflow as the product. Do that, and autonomy compounds. Skip it, and the review queue does.

Context + agent = leverage. Skip the first half, and the rest collapses.

Dhyey Mavani is accelerating generative AI at LinkedIn.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

Twitter Facebook Pinterest Linkedin