Skip to content

Skill-Driven Development

Explorative Development

Practitioner notes on engineering as a sequence of experiments — and on who does what in the loop.

Before we wrote the production code for a recent platform feature, we ran a fictional organization through two years of using it.

Twenty-four simulated months of compliance life: onboarding, supplier churn, audits, incidents, people leaving with their knowledge. The simulation produced fifteen architectural findings — wrong assumptions and missing pieces, discovered while every fix was still cheap and no customer existed yet. Sibling simulations took the total to twenty-five. A design meeting on the same material would have produced opinions.

That run is the clearest recent example of how I've worked for years, and of what I've started calling the approach out loud: explorative development. An idea becomes a hypothesis. The hypothesis becomes the cheapest implementation or simulation that could prove it wrong. The result gets verified. What survives is kept — and what was learned gets encoded, either way.

It is not a new method. It's the scientific method wearing a hoodie. Two things are new: the price list, and the fact that I no longer run the loop alone.

Organized Truths

Practitioner notes on the verification you only do once.

In the previous post I described catching an agent claiming a parser was "fully RFC compliant." I caught it by opening the RFC — four minutes of reading against a parser that handled none of the wildcard support the spec requires.

The tips in that post were about catching such claims. This post is about a better question that took me longer to ask:

Why did the agent never open the RFC?

Not because it couldn't read it. Because the RFC wasn't there. The agent had the parser in its context and the spec in its vibes — a compressed, lossy impression from training data. Asked to compare code against a standard, it compared code against its memory of the genre of that standard. Of course it produced an adjective.

You can audit that failure forever. Or you can change what the agent reasons from.

False Alarms and False Assurances

Practitioner notes on verifying what your agents tell you.

This week an agent told me, confidently, that an API endpoint had no authentication.

It did. The router was mounted twelve lines after the auth middleware. The agent had read the route file — clean, self-contained, no auth code in sight — and reported what it saw. What it saw was true. What it concluded was false.

The same afternoon, two more claims from the same research run didn't survive contact with the source: a parser described as "fully RFC compliant" (it lacked the wildcard support the RFC requires), and a plugin described as "active" (it was active only because of an import side effect in an unrelated legacy file — a load-order accident no test asserted).

Three wrong claims, one afternoon, inside an otherwise excellent piece of agent research that compressed days of code archaeology into hours. This is not a complaint about agents. It is a job description for the human.

The Compound Developer

A developer at their desk, knowledge network nodes glowing through the dark

In the most rigorous study of AI coding tools conducted to date — a randomized controlled trial by METR published in July 2025 — sixteen experienced open-source developers used AI assistance on tasks in their own projects. Projects they had worked on for an average of five years. Before each task, they predicted AI would reduce their completion time by 24%. After each task, they estimated they had been sped up by 20%.

The actual measurement: they were 19% slower.

The METR perception-reality gap: predicted +24%, felt +20%, actual -19%

The perception-reality gap in that study is between 39 and 44 percentage points. The developers were not exaggerating. Working with AI genuinely feels faster. But something in the translation from felt experience to measured outcome goes wrong — and understanding what, exactly, goes wrong is the only path to what actually works.