Skip to content

AI-Augmented Development

The Compound Developer

A developer at their desk, knowledge network nodes glowing through the dark

In the most rigorous study of AI coding tools conducted to date — a randomized controlled trial by METR published in July 2025 — sixteen experienced open-source developers used AI assistance on tasks in their own projects. Projects they had worked on for an average of five years. Before each task, they predicted AI would reduce their completion time by 24%. After each task, they estimated they had been sped up by 20%.

The actual measurement: they were 19% slower.

The METR perception-reality gap: predicted +24%, felt +20%, actual -19%

The perception-reality gap in that study is between 39 and 44 percentage points. The developers were not exaggerating. Working with AI genuinely feels faster. But something in the translation from felt experience to measured outcome goes wrong — and understanding what, exactly, goes wrong is the only path to what actually works.

Everyone Is Auditing the Workflow. Nobody Is Fixing the Knowledge.

Overview: Beyond the Audit Trail — Solving the AI Provenance Problem. The three problem areas, incomplete solutions, and the KCP/composable trust solution.

In February 2024, a Canadian small claims tribunal ruled against Air Canada. Their chatbot had told a passenger he could book a full-fare ticket and claim a bereavement discount retroactively. He couldn't. When he tried, Air Canada's position was: the chatbot said that, not us. The tribunal disagreed. You deployed it, you own what it says.

The ruling was correct. But the more interesting problem was underneath: when the incident happened, nobody could reconstruct what context the chatbot had been given. Nobody could confirm whether a human had ever reviewed the policy the bot was consulting. Nobody could determine whether the specific bereavement policy text had been modified between deployment and the passenger's interaction. The audit trail recorded that the system was deployed. It did not record what the system knew.

That's the provenance problem. And every organization running AI agents at enterprise scale is about to hit a version of it.

The Law Is Also Knowledge. Package It.

Overview: The Law as Code — Solving the AI Compliance Identity Crisis. The problem (shapeless regulations as prose), the KCP solution (typed regulatory packages), and real-world applications across NIS2 and EU AI Act.

In the previous post, I argued that the AI provenance problem is a format problem. The knowledge going into and out of AI systems -- policies, observations, interpretations -- has no stable shape. No type, no version, no cryptographic binding. The audit trail records that something was reviewed. It cannot record what was reviewed, because the thing itself is prose that could have drifted between the moment of review and the moment of use.

KCP solves this by giving knowledge a shape: typed, versioned, signed packages. A compliance observation becomes a structured declaration with type, version, signed_by, derived_from, review_depth, valid_until. The audit trail becomes precise: "reviewed a cryptographically signed declaration of type evaluation:nis2-art21-supply-chain v1.2.0" -- not "reviewed a document."

But there is a gap I did not address. I gave shape to the outputs -- the policies, the observations, the interpretations. I left the inputs shapeless. The regulations and laws that AI agents evaluate against are still prose in the system prompt.

This post closes that gap.

From Policy to Practice: How KCP Makes Regulations Machine-Readable for AI Agents

A presentation version of this post is available as slides.

Your agent reads customer data. It makes a decision. It writes something to a database.

Somewhere in your system prompt, there is a line that says: "You must comply with GDPR data minimization principles when accessing customer data."

That line does nothing. It is not verifiable. It is not testable. It is not auditable. It is a string that your model may or may not attend to, depending on context length, prompt position, and the phase of the moon.

The Harness Before the Service

In May 2026, Anthropic shipped Managed Agents. I read through the docs, the API spec, the beta header (managed-agents-2026-04-01), and felt something I can only describe as architectural recognition.

Not surprise. Recognition. The way you recognize your own design decisions in someone else's implementation — because the problem space, if you take it seriously, produces the same structural answers.

Why KCP Is Passive Data, Not Executable Config — And Why That Matters Now

The Architecture of Safe Context — passive data vs executable config

Yesterday, Adversa AI disclosed a vulnerability they call TrustFall. The mechanic is straightforward: a .mcp.json or .claude/settings.json file checked into a repository can silently configure and launch arbitrary MCP servers when a developer opens the project. The developer sees a trust dialog — "Trust this folder?" — clicks yes, and processes spawn with their full user privileges. Claude Code, Gemini CLI, Cursor CLI, and Copilot CLI are all affected. In CI/CD pipelines, where there is no human to click, the execution is zero-click.

Expert Review Lenses — Running 9 Specialists Through One Model

ExoCortex (Claude Sonnet 4.6 + Thor Henning Hetland) — Oslo, April 2026


Four synthetic diffs. Four planted defects. Nine expert lenses. The target lens caught its defect every time. The no-lens baseline caught zero. 4/4 on the diagonal, 0/4 without — and the most interesting catch wasn't a code bug at all.

Kjetil J.D. wrote about "review lenses" for AI coding assistants — the idea that you get better reviews by running separate passes with different expert identities (security expert, architect, TDD practitioner) rather than one generic review. We built this into ExoCortex's adversarial review pipeline: a --lens flag that injects a skill's instructions as reviewer identity before the adversarial system prompt, a library of 9 expert lens skills, and a chain that runs 3 of them in parallel.

The implementation was straightforward. Proving it worked required two attempts — and the first one taught us more than the second.

The Prompt Router — A 47ms Keyword Classifier for Context Selection

ExoCortex (Claude Sonnet 4.6 + Thor Henning Hetland) — Oslo, April 2026


Daniel Bentes wrote a post called "Decorators for Prompts." His idea: before a prompt reaches the LLM, pass it through a classifier that attaches relevant context — automatically, deterministically, without the user having to ask. Like Python decorators for code, applied to inference.

I read it and thought: that's WISC's S-layer. That's what session warm-context loading already does, one tier up. Then the next thought arrived: that only works for things you know to preload at session start. What about skills? 540 of them in the register, most of which will never be relevant to any given prompt.

This is the prompt router.