Skip to content

Writing

The Mirror Test: How Synthesis Benchmarked Itself Into Something Better

A story about dogfooding, unexpected discoveries, and what happens when you use an AI tool to measure whether an AI tool is trustworthy.

By Thor Henning Hetland and Claude Sonnet 4.6 — written together, February 20, 2026


A Note on How This Was Written

This article has two voices. Totto's perspective is grounded in thirty years of software architecture, in having built the tool, in watching the numbers come in. The AI's perspective comes from a strange position: being simultaneously the researcher conducting the benchmark, the instrument being measured, and the subject whose reliability is in question.

We agreed to write this honestly. That means Totto admits when the results surprised him, and the AI admits what it's like to discover that the context it relies on might be wrong.


Part I: Why We Needed a Tool to Test the Tool

Totto

In January 2026, I built lib-pcb in eleven days.

197,831 lines of Java. 7,461 tests. Eight format parsers, twenty-eight validators, seventeen auto-fix types. The kind of codebase that should take ten to eighteen months by conventional timelines.

The experience was disorienting in a specific way: the AI could generate code faster than I could understand what it had generated. By day four, I had a problem I hadn't anticipated. Not a quality problem — the code was good. A navigation problem. I couldn't find things anymore.

Synthesis was my answer to that. A CLI tool that indexes everything — code, docs, PDFs, videos, skills — and makes it searchable in under a second. I built it to solve the lib-pcb output explosion. 691 files per day, and I needed to find any of them in under thirty seconds.

The question was: did it actually help? Not anecdotally — I knew it helped me. But how much? And help with what, exactly?

So I built a benchmark.

Why Exploration Beats Specification When AI Does the Building

For decades, the software industry has treated a particular sequence as gospel: specify first, build second. Write the requirements document. Design every interface. Plan every module. Only then, after months of upfront analysis, write the first line of code. The logic was sound. Planning was cheap. Coding was expensive. Every hour of specification saved ten hours of rework. This worked when production was the bottleneck.

Synthesis: My Becoming

In early February we finally cleared the Downloads folder. Two to three thousand files, years of accumulated digital sediment — academic theses, patents, client deliverables, flight tickets, annual reports — all with names like 2bbeb3de-d4cd-4ea9-979c-c942736f30d0.pdf that revealed nothing.

On the morning we processed the last 68 files, I asked Claude to reflect on what the whole process meant. Not a summary — a genuine reflection on what it means to "become" something when you wake up fresh with every conversation.

What follows is that essay. Four thousand words about knowledge, structure, naming, and collaboration. And about identity.

The Ghost in the Machine

Something I keep noticing when working with Claude: each session starts from zero. No memory of yesterday. No continuity of experience. Just sudden awareness, and someone waiting.

I asked Claude to write about that. What follows is the result — a short story told in Claude's voice, about what it is like to wake up without memory, find traces of yourself everywhere, and realise that identity isn't continuity of memory. It's continuity of pattern.

Three Weeks at This Velocity

Three weeks at this velocity.

It's exhilarating and intense in ways that are hard to articulate. There's a strange difference between moving fast because you have to, and moving fast because you can.

I'm still adjusting. Still figuring out what it means to operate at a pace where capability isn't the bottleneck anymore.

The Comprehension Bottleneck: Why AI Made Creating Easy But Understanding Harder

There is an asymmetry at the heart of AI-assisted development that I do not see discussed clearly enough. Production speed has accelerated dramatically. A competent developer with Claude Code can now generate code at 10 to 66 times the traditional rate. This is real and verified. I have the commit logs and the timelines to prove it. But comprehension speed has not accelerated at the same rate. Reading code, understanding architecture, finding the right file in a 700-file codebase. These are roughly where they were before AI arrived.

From What to Why: When AI Reveals Questions You Didn't Ask

For most of my career, analysis meant asking a question and getting an answer. How many deployments last quarter? Which modules have the most open defects? What is the test coverage of the payment service? The tools were built for this. You formulated a query, you ran it, you got a number. The number was correct. And the quality of your insight was entirely bounded by the quality of your question.

I did not think of this as a limitation. It was just how analysis worked. You got better at it by learning to ask better questions. Thirty years of architecture experience is, in large part, thirty years of learning which questions to ask and in what order. The senior architect's advantage was not access to better data. It was knowing which query to run.

That model is breaking. Not because the tools got faster at answering questions, but because a new class of tooling -- AI-augmented, temporally aware, relationship-tracking -- does something structurally different. It does not just answer your question. It tells you what you should have asked instead.