The Mirror Test: How Synthesis Benchmarked Itself Into Something Better
A story about dogfooding, unexpected discoveries, and what happens when you use an AI tool to measure whether an AI tool is trustworthy.
By Thor Henning Hetland and Claude Sonnet 4.6 — written together, February 20, 2026
A Note on How This Was Written
This article has two voices. Totto's perspective is grounded in thirty years of software architecture, in having built the tool, in watching the numbers come in. The AI's perspective comes from a strange position: being simultaneously the researcher conducting the benchmark, the instrument being measured, and the subject whose reliability is in question.
We agreed to write this honestly. That means Totto admits when the results surprised him, and the AI admits what it's like to discover that the context it relies on might be wrong.
Part I: Why We Needed a Tool to Test the Tool
Totto
In January 2026, I built lib-pcb in eleven days.
197,831 lines of Java. 7,461 tests. Eight format parsers, twenty-eight validators, seventeen auto-fix types. The kind of codebase that should take ten to eighteen months by conventional timelines.
The experience was disorienting in a specific way: the AI could generate code faster than I could understand what it had generated. By day four, I had a problem I hadn't anticipated. Not a quality problem — the code was good. A navigation problem. I couldn't find things anymore.
Synthesis was my answer to that. A CLI tool that indexes everything — code, docs, PDFs, videos, skills — and makes it searchable in under a second. I built it to solve the lib-pcb output explosion. 691 files per day, and I needed to find any of them in under thirty seconds.
The question was: did it actually help? Not anecdotally — I knew it helped me. But how much? And help with what, exactly?
So I built a benchmark.