Beyond 2x – Keeping Agents in Line without Reading Every Line

The go-to industry standard approach to doing production-safe agentic development is the 70/30 rule – let the AI agent do 70% of the work, but be on the hook for 30% of the work as manual review and fixes. That caps us at around 2x what we could do without AI since 30% plus managing agent logistics (prompting, chatting, reading, etc, before the PR even gets posted) brings you up to at least 50% pretty quickly.

Given that agents can type many times faster than humans and only pause briefly to think and do research, 10x+ is easily achievable just in terms of raw generation speed. And indeed, when using agents to go from prompt to initial prototype it’s possible to reap those gains – especially if you aren’t picky about how well it works and don’t need to build it further.

But what about production code that needs to be not just functional, but meticulously aligned with product specifications? Especially for long-running, heavyweight codebases (“brownfield”), short prompts no longer work. As complexity escalates, throughput diminishes.

To nail down exactly why, let’s start with a short set of qualities we know are non-negotiable simply by virtue of the scale of human effort required in their absence and the limits of how fast we humans can do things. We can use these as shared qualitative metrics to evaluate solutions.

Asymmetrical Review

The most ubiquitous complaint you’ll hear about the naive approach to heavy agentic workflows is the feeling of drowning in an ocean of code review. If we follow legacy paradigms and have the agents file github pull requests with their untrusted code that needs to be approved by a human, then we’re blocking the whole process to be human code comprehension speed. If the code is novel at all (and if it’s not, why make it?) then no matter how quick a reader you are, this will prevent you from realistically getting past 2x.

So, therefore, reading less than every line of code becomes a core constraint in our hypothetical 10x system. Whatever shape reviews take, they must be asymmetrical – code generated to code read by the human must not be 1:1.

Streamlined Prompting

Prompts rich with context can be an effective way to keep agents aligned with intent, but we can’t just go full maximalism on overspecifying everything up front as that puts us right back into being the bottleneck just at the front end of the batch of work, rather than the review end.

Just as we need new, shorter ways to validate code, we need new, more streamlined ways to set the AI agents on the right pathway from the start. Intent of the new feature, supporting business context, and reconciliation against prior decisions must at some point come from the human, but must be done in a way that we can capture and reuse as much as possible.

Independently Parallelizable Agents

If you’ve done any amount of modern agentic coding you’ve hopefully experienced the joy of letting an agent run unsupervised for over an hour as it gets actual useful work done. And then you’ve probably also experienced the conundrum of what to do while your agent is tinkering away on its own without needing your involvement. If the tasks are designed to be worked on in serial, as you might have done if working by yourself on a project pre-AI, then you can’t start the next agent until the previous agent’s tasks are done, reviewed, revised, and merged. And conversely if the next agent is already started then there’s not much more to do with the previous agent’s work.

Another issue with serially-arranged agentic work is that agents are completely fine with working 24/7 – they don’t need sleep but you do. If you want to have a life away from the computer at all then you can’t be babysitting agentic workloads every 30-60 minutes from when you wake up to when you sleep.

The solution is an aggressive lean into parallelization – much easier said than done. Two agents working on two tasks in parallel means merge conflicts to solve in addition to twice as many prompts to prepare and twice as much code to validate. This stretches our budget about how much we have time to do each task from tight to razor thin.

Pain Points

The naive approach would be to simply kick agents off with sparsely worded prompts, delegate verification of agent results to verifier agents (or rubber stamp the results entirely), and to parallelize whenever possible and just rush through the merge debt. There’s a few specific issues that tend to crop up after some number of rounds of this, even with the best models.

Unvetted Agent Decisions

Writing code is a constant series of making decisions which are largely invisible from just looking at the result. Most of those decisions are small scale – their direct impact doesn’t go beyond a line, a function, a file. However, many are high impact – like architecture, external library choice, and testing strategy. Agents are more than willing to decide all these things for you, and a lot of the time that’s a huge boon.

However, especially in the case of sparsely written prompts, these decisions can be completely contrary to your intent for the feature. Also, because prior interactions are not automatically remembered, the agent might make decisions which completely compromise the trajectory of earlier work.

Agents often debrief a turn with a list of decisions made which is helpful, but deconstructing the minutia of dozens of decisions and trying to tie them back to what you did or didn’t intend from your sparse prompt is a very time consuming process – as is listing out all the wrong decisions and pairing them with nudges to fix, waiting for the second round results as it goes back and attempts to surgically redo key parts of its work.

Wrong Decisions Tend to Get Compounded Rather than Rectified

The especially insidious nature of these divergent agentic decisions is less so in their immediate impact, and more so in how they invisibly change the trajectory of the development moving forward – especially when paired with sparse prompts. When an agent spins up with a blank slate and starts investigating about how to fulfil its task, if it can’t deduce the necessary context from the prompt then it will backfill it from what it can glean about prior work done. The best signal for this is usually from reading the current state of the code and reverse engineering it to what the intent likely was.

There’s no straightforward way for the agent to be able to deduce whether a line of code and its associated behavior came from a human, LLM tab completion, or a blind guess from a prior agent. Using version history can give some amount of insight, but is certainly neither straightforward nor token-efficient. The main signal the agent has here is that the code did make it to a merged commit, so in the absence of something more substantial that arbitrary line of code becomes law.

Divergent decisions can infest a codebase like a virus. A single remaining comment, or line in a doc, or errant test docstring can cause the bad behavior to crop up again and get reinforced from a well-intending AI agent trying to reconcile things best it can given the reference material available. Sparse prompts can get easily overpowered by conflicting stale signals from the codebase.

AI-Generated Tests Tend to Validate Implementation Rather than Intent

The traditional approach to ensuring code matches and sticks to intent is automated testing. Encode the intent as a test and then get a pass/fail signal to indicate whether that contract has been breached. Of course, this fully depends on the test being an effective measure of the human intent and not all intentions map well into an automated check.

Even when the intent is easily testable though, AI has a documented bias towards writing tests that simply assert arbitrary specifics of the implementation rather than capturing the intent behind it. Guiding the agent to write expressive, behavior-oriented tests which are tightly aligned with human intent risks crowding our plate with an additional task – pushing us further from our goal of sparse prompts and minimized feedback loops.

Because tests can be as much a source of issues as a solution, and because we can’t afford the human review bandwidth of manually verifying and critiquing test cases, we clearly need a different mechanism as our source of truth for intent.

Spec Driven Development?

If we focus on a frozen set of intents at one point in time, the Spec Driven Development movement appears to compellingly solve, or at least assist with, all of these issues.

The specification layer provides an elevated source of truth optimized to be read and edited by a human. As an artifact that persists between tasks, it obviates the need to respecify prior-existing aspects of a project when we spin off agents to build new features. Because we can say the specs take priority, we don’t need to worry about code remnants of old product states leading to conceptual conflicts. And since our specifications are simply natural language rather than automated tests we can edit them manually if we need to while staying nimble.

However, we learn about how to build the products we’re building as we build them and our plans adjust accordingly. What made sense as being specified at one point for one feature might not make sense as a foundation for future work. Our history of specifications can start as a series of layered platforms to build on but end up as a cage that prevents us from specifying features that go in new directions – even if the scale of code changes required are well within the capabilities of the AI agent.

We need to preserve old specifications so we still have access to the intent behind the features they’re describing, as most of those features continue to exist. However, the number of conflicts between specifications will increase combinatorially as more get accumulated. Eventually, some form of “flattening” old specifications together is necessary but this is very low value work since the older a specification is, typically the less relevant it is to current work, as well as the more faded human memory will be about it. If we decline to flatten them then the specification layer no longer serves as a clear source of truth – and we’re back to the same problems we tried to solve with yet another layer of complexity to keep up with.

A Normalized, Hierarchical, Versioned Knowledge Store

“Normalized” is the main headline here. Whenever new specifications get implemented there needs to be a proactive reconciliation into a separate, incrementally maintained knowledge store. That way the reconciliation action (or some verification step of it before accepting it as authoritative) becomes the slot for the human to get involved and hypothetically their involvement may be as small as confirming a normalized spec diff (or, more likely, an AI summary of it) which ought to be the most distilled possible form of the conceptual change, exactly what we need for minimizing bottlenecks.

“Hierarchical” is a bit of an educated guess, but due to the nature of how specifications need to be elaborated on and expanded on routinely in order to clarify ambiguities, hierarchical just seems most natural as a simple, intuitive structure that trivially supports injecting new statements beneath existing statements without needing to otherwise restructure.

“Versioned” is the most exciting part to me. When the whole knowledge store is versioned we can fearlessly branch out into hypothetical variations that can be fully referenced by the version for purposes of planning, testing, and validation without disturbing AI agents already in motion working relative to their own local versions. We can conversationally reference multiple paradigm shifts by SHA (or whatever form of version identifier) and let the agent do the heavy lifting of looking them up, perusing diffs, consulting logs, etc to learn what it needs to learn to understand the concepts being presented in your prompt without needing to reverse-engineer code in a vacuum.

Versions as Anchors

Putting these all together, we have a knowledge store composed of a series of versions – each version serving as a conceptual anchor to guide code changes either towards or away from a particular implied code state. When a new version is too ambiguous to be able to be fulfilled the agent can come back with one or more new provisional versions of the spec, asking which variation is closest to the human’s intent which would lead to that version getting promoted as the new target.

Alternatively, maybe there’s a straightforward enough arbitration that the agent feels comfortable making and not bothering the human for acceptance until it comes to requesting merge approval, and the implications of those decisions can be presented as provisional changes to the spec. If accepted, the new version becomes an anchor for future work – so changes in the spec should be considered mandatory for human review.

Since by this point we’ve already carefully reviewed the core conceptual changes, depending on the context, it starts to become less dangerous to sparsely summarize code diffs for review rather than demanding the human review every line.

The Solution*

Through discussing these problems with my peers I’ve realized that the issues are incredibly nuanced and everyone has a naturally different way of compartmentalizing the dynamics at play. Because of this, I felt that clearly defining the problems in a solution-agnostic way was the most valuable tool as I suspect we have many more tools to build in order to fully realize these ideals.

That being said, here are some resources which may provide some inspiration and/or progress along the path:

OpenSpec – github.com/Fission-AI/OpenSpec

  • Oriented around a living specs folder which holds incrementally reconciled knowledge from diff-oriented entities stored in a `changes/` subfolder
  • Cursor-friendly (plus support for others) commands like `propose`, `apply`, `archive` etc which likely could be composed into higher level workflows
  • Uses Beads-style “reuse the project repo for versioning”
    • Potentially complicates things like “did you mean version A, B, or C?” since the agent would need to actually commit and perhaps push these versions to remote mid-turn to be able to reference them by version

Augment Intent – docs.augmentcode.com/intent/overview

  • Very conceptually aligned but highly opinionated
  • Handles both spec alignment and agent orchestration
  • No Cursor support, requires using Augment’s `auggie` CLI tool
  • Apple-only 👎

Giterloper – github.com/jcwilk/giterloper

  • My own exploratory project around wrapping an MCP server around a simple agent for managing and editing git clones of a remote knowledge-only github repository
  • Not stable or feature complete

John Berryman – arcturus-labs.com/blog/2025/10/17/why-spec-driven-development-breaks-at-scale-and-how-to-fix-it/

  • A kindred spirit with very similar conclusions

I’m interested in discussing this paradigm, please contact me if you are too.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *