Skip to content

Building Together: An 11-Day Human-AI Collaboration Story

This is the full story of lib-pcb -- a production-grade PCB manufacturing library built in 11 days through human-AI collaboration. It started as a weekend experiment in an unfamiliar domain and became the most compelling evidence I have for what disciplined AI-augmented development can achieve.

It Started as a Weekend Experiment

"I wanted to see how far I could push Claude Code in a weekend," Thor Henning Hetland (Totto) explains. A software engineer with 40 years of experience, he'd watched the AI coding assistant landscape evolve from GitHub Copilot's autocomplete to full agentic systems. The domain he chose -- PCB manufacturing automation -- was deliberate: he had almost no knowledge of it. Could you actually build something production-ready in an unfamiliar domain? Could you maintain velocity as the codebase grew? Could the human's expertise grow alongside the AI, rather than atrophy?

The weekend experiment turned into 11 days. And honestly, Totto wasn't convinced he could pull it off at all. "This was an experiment. I had almost no domain knowledge. I didn't know if you could actually build something production-ready this way."

By Day 2, something shifted. "I looked at what we'd built and thought: this is insane. This changes everything." Then, immediately: "...if I'm not being completely fooled here." That dual reaction -- excitement and fear -- persisted throughout.

By the end, they'd shipped 197,831 lines of production code, 7,461 tests, and 55 releases. A production-grade library implementing 8 industry file format parsers, 28 validators, and 17 auto-fix types. Not a prototype. Not a demo. Production code handling real manufacturing data for actual PCB fabricators.

When you account for domain complexity (binary format parsing is 8-10x harder than typical web development) and degradation resistance (maintaining 90% velocity as the codebase grew from zero to 200K lines), the effective productivity multiplier reaches 25-66x traditional development at a minimum. A more aggressive calculation that includes complexity adjustment puts it at 320-1,320x, though that number requires several compounding assumptions and should be treated as a theoretical upper bound rather than a direct claim.

Those numbers sound impossible. They felt impossible while it was happening.

This is the story of that experiment -- how they built it, what they learned, and why the last four days, when they tackled the gnarliest real-world edge cases, taught them the most about effective human-AI collaboration.

The Fear That Drove Everything

Before diving into the four-day sprint, it's important to understand what shaped the entire approach.

Journalist: "What was your biggest concern starting this experiment?"

Totto: "I was terrified of getting fooled by LLM shortcuts and hallucinations. AI is incredibly confident even when it's wrong. It generates code that looks right, compiles cleanly, but has subtle bugs that only appear in production. I spent most of my time not writing code, but thinking of ways to find and fix those potential issues. How do you verify AI output cheaply? How do you catch hallucinations before they compound? That fear drove everything -- the test corpus, the verification strategies, the whole Trust But Verify pillar."

That paranoia turned out to be productive. The verification infrastructure caught bugs that would have been catastrophic in production. And crucially, verification proved cheaper than debugging, making the paranoia economically rational.

January 29, 2026: Four PRs That Told a Story

By Day 11 of the experiment, they had 197K lines of working code. The library could parse seven different PCB file formats, validate designs against IPC manufacturing standards, and auto-fix common DRC violations. It worked beautifully on well-formed input.

Then real users sent real files.

Package 256916: Purely numeric filenames. 256916-02.gbr, 256916-03.gbr. No obvious indication of which layer is which.

Package 252294: Manufacturer had renamed every file (board-F.Cu.gbr -> 252294-02.gbr) but forgot to update the metadata. Zero files matched. Conversion failed completely.

Package vsroxarflowmeasurementponumber4230024203elprintn.zip: Yes, that's the actual filename. No data files at the top level. Just a PDF and another ZIP. That ZIP contained another ZIP. Only at depth 2, nested inside two archive layers, did the actual 21 files appear, using naming conventions we'd never seen.

The code they'd built handled the specification perfectly. Real-world data didn't follow the specification.

The Conversation Started

Totto: "We need to handle these. Manufacturers are sending us packages that don't follow KiCad conventions."

Claude: "I can see three different failure modes in these packages. Should I implement .gbrjob metadata parsing first to handle the numeric filenames?"

Totto: "Yes, but we have to talk about this: Layers: 7 -- that's not possible when talking manufacturing PCB. PCBs need symmetric stackups. You can have 1, 2, 4, 6, 8 layers, never odd numbers above one."

Claude: "Understood. The metadata can lie. I'll parse it but add validation that warns without blocking. The actual files are the source of truth."

This back-and-forth happened throughout the sprint. Software engineering expertise from Totto, learning PCB manufacturing constraints in real-time. Rapid implementation from Claude. Mutual course-correction as both learned together.

The Four-Day Journey: What Actually Happened

PR #179: The Big Feature (15,927 Lines)

January 29, morning

Started with .gbrjob parsing. JSON metadata files that map cryptic filenames to actual layer functions. Implementation choice: manual parsing, no Jackson dependency. Keep it lightweight.

Halfway through, discovered package 252294 -- where metadata references files that don't exist. The manufacturer renamed everything but didn't update the JSON.

Totto: "This fails completely now. Zero files match."

Claude: "We need smart fallback. Three strategies: parse embedded file declarations directly, recognize manufacturer numeric patterns, use heuristic matching. Activate when match rate drops below 50%."

By afternoon: Smart matching implemented. Package 252294 recovered 7 out of 8 files. Conversion successful.

Then the nested ZIP problem appeared.

Totto: "vsroxar package has no files at top level. They're two levels deep."

Claude: "Recursive extraction with safety limits. MAX_NESTING_DEPTH = 3, MAX_NESTED_ZIP_SIZE = 100MB. Extract automatically when no files found."

Implementation, tests, documentation created in parallel. The 3-layer context system (CLAUDE.md index + 54 specialized skills + implementation history) meant Claude knew the patterns, the anti-patterns, the testing requirements, the documentation structure. First-draft code followed established conventions.

By evening: PR ready.

Totto: "ok, lets make a great PR?"

Merged. Version 0.51.4 released.

Issue #180: The Bug Discovery (2 Hours Later)

January 29, late evening

Totto: "we have a new github issue"

The vsroxar package extraction worked -- files found, layers detected, types identified. Then: ClosedFileSystemException.

The nested ZIP extraction returned Path objects that referenced ZIP filesystems. When the try-with-resources block exited, those filesystems closed. Later, when the parser tried to read the Gerber files, the paths pointed to dead filesystems.

The bug report was detailed: stack trace, affected workflow, root cause hypothesis, suggested fix.

PR #181: The Fix (Same Day)

Totto: "Let's fix this now."

Claude: "Instead of returning filesystem references, extract actual file contents to disk during scan. Change path scheme from jar:file:/tmp/outer.zip!/file to file:/tmp/extracted/file. Files remain accessible throughout conversion pipeline."

Implementation: Extract to temp directory, return disk paths, add regression tests.

Test 1: Verify files readable after extraction (would throw ClosedFileSystemException before fix) Test 2: Verify files use "file:" scheme not "jar:" scheme

Hours from bug discovery to fix merged. The tight feedback loop -- real test corpus, automated verification, comprehensive context -- turned a potential multi-day debugging odyssey into an afternoon fix.

Totto: "merged"

PR #182: Documentation Catch-Up (Next Morning)

Four specialized skill documents created:

  • gbrjob-parsing/ -- Format specs, FileFunction parsing, integration patterns
  • smart-file-matching/ -- Content analysis, pattern recognition, confidence scoring
  • nested-zip-extraction/ -- Recursive extraction, safety limits, zip bomb prevention
  • Updated gerber-package-handling -- Master index linking to specialized topics

Totto: "merged"

PR #183: Test Cleanup (Afternoon)

Jenkins reported 24 failures. Two issues:

ShapeButterflyParser: Passing outerSize=0 to constructor that validates outerSize > 0.

Fix: Math.max(1, header.width) -- Use header width as placeholder, minimum 1 nanometer. Three characters of code, 21 tests green.

Test256916Package: Hardcoded paths to manually extracted files not available on CI.

Fix: Add @Disabled annotation with explanation. Tests runnable locally, skipped on CI.

Totto: "merged"

Version 0.51.5 released. Four PRs in under 24 hours. Zero failures. Production-ready.

The Method: Why This Worked

This four-day sprint succeeded because the foundation existed before they started. By Day 11, they'd already established patterns that enabled this kind of velocity.

What made this remarkable: Totto brought 40 years of software engineering expertise but almost no PCB manufacturing knowledge. The collaboration wasn't expert-instructs-AI. It was experienced-engineer-and-AI-learn-domain-together. The software engineering judgment (architecture, testing, verification, process discipline) provided the framework. Claude provided implementation speed and domain research capability. Both learned PCB manufacturing constraints in real-time.

The Six Pillars (Proven Over 11 Days)

1. Intelligent Context

The context system wasn't a single file -- it was a sophisticated 3-layer architecture:

Layer 1 (Index): CLAUDE.md - 767 lines serving as the master index, pointing to specialized knowledge Layer 2 (Skills): 54 domain and technical skill files - Specialized knowledge on specific topics (binary parsing, Gerber formats, validation strategies, testing patterns) Layer 3 (History): Implementation history and learnings - What happened, what failed, what worked, anti-patterns discovered

This wasn't written on Day 1. It evolved. Every bug fixed updated Layer 3 history. Every pattern discovered became a Layer 2 skill. Every architectural decision updated Layer 1 index.

By Day 11, Claude read comprehensive context every session: Layer 1 for navigation, Layer 2 for deep knowledge, Layer 3 for lessons learned. First-draft code followed conventions because the three-layer knowledge base encoded institutional memory.

Impact: Implementation that's correct on first try, not third iteration.

2. Strategic Delegation

Not everything needs the most expensive model -- and quota constraints forced discipline.

The Economics Reality: Totto ran on Claude MAX subscription (unlimited usage within weekly quota). "I was burning through the weekly quota in 3-5 days. If I were using normal API token billing, this would be way, way, way too expensive. I had to continuously do hard prioritization against the quota."

That constraint drove strategic model selection:

Haiku ($0.25/$1.25 per MTok): Pattern-following work. Tests, documentation, similar parsers following established templates. Saves quota for complex work.

Sonnet ($3.00/$15.00 per MTok): Complex problem-solving. The ClosedFileSystemException root cause analysis. Smart matching strategy design. Used when Haiku couldn't handle it.

The template pattern: Sonnet creates one exemplary implementation. Haiku creates remaining similar ones. 85% cost reduction without quality loss -- and crucially, stayed within quota.

Real example: Test suite creation on Day 1: 3,015 tests in 4 hours using Haiku. Cost: ~$0.50 (or minimal quota usage). Equivalent Sonnet cost: ~$6.00 (or would have burned quota faster). Savings: 92%.

3. Trust But Verify

This pillar emerged from Totto's fear of being fooled by AI hallucinations.

The Problem: AI generates confident code that looks right, compiles cleanly, but contains subtle bugs. Field order wrong by one position in binary parsing? Every field reads corrupted data. String length misinterpreted? Parser crashes on valid input. These bugs pass casual review because they're not obvious.

Totto's Response: "I spent most of my time not writing code, but designing verification strategies."

The Strategies:

  1. Real-world test corpus: 191 actual user files. The vsroxar package, the 252294 package, the 256916 package -- actual manufacturer data with all their messiness. Synthetic tests pass on buggy AI code because AI generates tests matching its own assumptions. Real user files expose assumptions immediately.

  2. Round-trip testing: Parse -> Write -> Parse -> Compare. If any data is lost, the round-trip fails. Caught serialization bugs that synthetic tests missed.

  3. Reference implementation comparison: When MIF format documentation was unclear, compared against reference implementations. "Field order is sacred" became an anti-pattern after the ComponentFeatureParser bug.

  4. Automated test suite: 7,461 tests running on every PR. The AI couldn't ship broken code even if it hallucinated.

The paranoia paid off. The verification caught bugs before they compounded. And crucially: verification proved cheaper than debugging. Five minutes of automated tests prevented hours of production debugging. The fear was economically rational.

4. Directed Synthesis

The phrasing matters more than you'd think.

Wrong: "AI, what should I do?" -> Makes AI the decision-maker, you the executor -> Your expertise atrophies

Right: "AI, analyze this from angle X" -> "Now I'll synthesize and decide" -> You stay in control, expertise grows

But Totto took this further: multi-AI orchestration for comprehension.

Claude would write extensive analysis reports -- architecture overviews, implementation status, technical deep-dives. These reports (often 50-100 pages of detailed analysis) became inputs for other AI systems:

ChatGPT Deep Research: Used for QA on Claude's reports. Cross-verify technical claims, identify gaps in reasoning, challenge assumptions. A second perspective on complex decisions.

NotebookLM: Fed the reports and codebase documentation to generate infographics, slideshows, and audio discussions. This provided the 10,000-foot view of how the codebase was evolving. Quick comprehension of complex systems through visualization.

The workflow: Claude generates -> ChatGPT validates -> NotebookLM visualizes -> Totto synthesizes and decides.

After 11 days and 474 commits, Totto's domain understanding had deepened, not diminished. He was making faster decisions with more confidence because the collaboration enhanced expertise rather than replacing it. The multi-AI orchestration meant he could absorb complex domain knowledge faster than reading specifications alone.

5. Process Discipline

Structure matters when working at high velocity.

100% PR workflow: Every change went through a pull request. No commits to main. Ever. This forced review, documentation, and intentionality even when working solo with AI.

Automated CI: Jenkins pipeline on every PR. Tests must pass before merge. The AI couldn't ship broken code even if it wanted to.

Documentation-as-code: Skills updated with features, not after. CHANGELOG entries written during implementation, not at release time.

Commit discipline: Clear, descriptive messages. Co-authored with Claude. Every commit told a story.

The discipline prevented the chaos that high-velocity development usually creates. You can move fast without breaking things if the process catches mistakes before they compound.

6. Continuous Learning

The 3-layer context system enabled systematic learning capture:

Layer 3 updates: Every mistake recorded in implementation history with root cause analysis Layer 2 updates: Every pattern became a new skill or enhanced existing ones Layer 1 updates: CLAUDE.md index updated to point to new knowledge

The system captured learnings and fed them back through the layers. Mistakes didn't repeat because the context evolved daily.

Example: The field order bug in ComponentFeatureParser on Day 2. One field misplaced in binary parsing causes catastrophic stream misalignment. Fixed the bug. Updated Layer 3 history with the incident. Created Layer 2 anti-pattern skill: "Binary parsing: field order is sacred -- verify with reference implementation." Updated Layer 1 to reference the new skill. Claude never made that mistake again.

The Productivity Achievement

The four-day sprint demonstrated something larger: sustained velocity even as complexity grew.

The Degradation Resistance

Most software projects slow down as they grow. Industry research shows consistent patterns:

Traditional degradation curve:

  • At 50K LOC: Productivity drops to 75% of initial
  • At 100K LOC: Drops to 50%
  • At 200K LOC: Drops to 25-30%

lib-pcb actual performance:

  • Day 1 (0 LOC): Baseline productivity
  • Day 5 (80K LOC): Productivity maintained
  • Day 9 (150K LOC): Productivity maintained
  • Day 11 (198K LOC): ~90% of initial velocity

Why no degradation?

Traditional degradation happens because:

  • Developers forget earlier decisions
  • Documentation becomes stale
  • Patterns drift without enforcement
  • Finding relevant code gets harder

AI-assisted development with 3-layer context inverts this:

  • Layer 1 (CLAUDE.md): Remembers all architectural decisions, points to relevant knowledge
  • Layer 2 (Skills): Stay current, updated with every feature
  • Layer 3 (History): Records what failed, what worked, anti-patterns discovered
  • AI searches entire codebase instantly, informed by three layers of institutional memory

The AI acts as institutional memory that never forgets and compounds over time.

The Numbers Explained

Base productivity: 16-33x traditional development. This is the directly measurable figure -- 11 days vs the 10-18 months industry standard for equivalent scope.

Complexity multiplier: Binary format parsing is 8-10x harder than typical web development, which further widens the gap against traditional timelines.

Degradation resistance: Maintaining 90% velocity at 198K LOC, where traditional projects run at 25-30%.

The conservative claim is 25-66x faster than industry standard (11 days vs 10-18 months). A more aggressive calculation that compounds the base productivity, complexity, and degradation resistance arrives at 320-1,320x. That upper figure is a theoretical model, not a direct measurement. The component-level breakdown below tells the clearer story:

Component-level breakdown:

Component Traditional lib-pcb Actual
MIF Parser (complex binary) 4-6 weeks 1 day
Test Suite (7,461 tests) 8-12 weeks 4 hours
8 Format Parsers 16-24 weeks 5 days
28 Validators (IPC standards) 8-12 weeks 3 days
17 Auto-Fix Types (DRC rules) 12-16 weeks 4 days
Documentation (61K LOC) 4-6 weeks Continuous
Total 10-18 months 11 days

The Validator & Auto-Fixer Complexity

The 28 validators and 17 auto-fix types weren't simple scripts -- they implemented actual IPC-6012 manufacturing standards and design rule checks:

IPC Validators (Class 2/2+/3 standards):

  • ProductionValidator: Drill aspect ratios, minimum trace widths, clearances
  • ImpedanceControlValidator: Hammerstad-Jensen formulas, controlled impedance calculations
  • DifferentialPairValidator: Length matching, skew tolerance, layer consistency
  • ViaStubValidator: Frequency-dependent lambda/10 rule, resonance detection
  • StackupCompletenessValidator: Manufacturing readiness, material validation

Auto-Fixers (17 types across 7 specialized classes):

  • SpacingFixer (1,115 LOC): DRC rule enforcement with geometric analysis
  • DrillOptimizationFixer (914 LOC): Aspect ratio optimization, tool consolidation
  • SilkscreenFixer: Pad clearance enforcement (100-200um based on IPC class)
  • SolderMaskFixer: Mask clearance optimization with R-tree spatial indexing
  • SolderPasteFixer: Area ratio calculations, BGA paste reduction (IPC-7525 stencil design)

Each fixer required understanding manufacturing constraints, implementing geometric algorithms, and validating against industry standards. The ImpedanceValidator alone uses professional-grade formulas (Hammerstad-Jensen, <0.06% error vs numerical field solvers) that most PCB design tools use.

These weren't trivial implementations -- they're production-grade domain logic that would require deep manufacturing knowledge to implement traditionally. With the 3-layer context system capturing domain knowledge as it was learned, Totto and Claude built these together in days.

What Transfers to Any Project

The PCB manufacturing domain doesn't matter -- in fact, that Totto chose an unfamiliar domain makes the lessons more transferable. These patterns work when learning new domains, not just when working in familiar territory.

What transfers:

1. Start with Infrastructure First

The 3-layer context system (Layer 1: index, Layer 2: specialized skills, Layer 3: history), verification suite (real user data), and continuous learning loop existed before the four-day sprint. The burst of productivity happened because the foundation was already built.

Don't wait until Day 50 to build knowledge infrastructure. Build it Day 1. The three layers create a compounding knowledge base that gets smarter every day.

2. Verification Must Be Cheaper Than Debugging

Five minutes of automated testing with real user files beats hours of manual code review. The ROI is obvious: invest in the test corpus early, catch bugs before they reach production cheaply.

Real user data contains edge cases no one thinks to test. Package vsroxar taught us about filesystem lifecycle management. Package 252294 taught us metadata can't be trusted. Package 256916 taught us numeric conventions vary by manufacturer.

3. Context Compounds Daily

Month 1: Basic guidelines Month 3: Comprehensive patterns Month 6: AI implements correctly first try

The context isn't configuration -- it's a learning system that compounds. Every bug fix makes the AI permanently smarter.

4. Phrasing Determines Who's in Control

"Analyze this from angle X" (you're in control) vs "What should I do?" (AI's in control)

After six months, the difference is dramatic. The first grows expertise. The second creates dependency.

5. Strategic Delegation Saves Money

Haiku is 12x cheaper than Sonnet. For pattern-following work (tests, docs, similar components), Haiku delivers identical quality at fraction of cost.

But this only works if your context is strong enough. Without CLAUDE.md and skills, Haiku generates generic code. With strong context, Haiku follows patterns perfectly.

The Conversation That Captures It All

Human (reflecting after Day 11): "What we've built here in 11 days would have taken a traditional team 6-12 months. But it's not just about speed. It's about building the right way -- with the human staying in control, the AI amplifying capability, and both getting smarter every day."

Claude: "The productivity metrics are extraordinary, but I think the more important achievement is qualitative: your domain expertise grew stronger through our collaboration, not weaker. After 11 days and 474 commits, you're making faster decisions with more confidence."

Totto: "Exactly. And that four-day sprint at the end -- solving those gnarly real-world edge cases -- that only worked because we'd spent the previous week building infrastructure: comprehensive context, real test data, verification loops, continuous learning."

Claude: "The foundation enabled the burst. The burst validated the method."

Addressing the Skepticism

Journalist: "When you tell people you built production-grade software in 11 days with AI in a domain you didn't know, what do they say?"

Totto: "The most common response is: 'There must be bugs you haven't found yet.' And honestly? That fear never left me. Even on Day 11, even after shipping v0.51.5, I was still thinking 'am I being fooled here?' That's why the verification infrastructure exists. The 7,461 tests, the 191 real-world file corpus, the round-trip testing -- it's all because I don't fully trust the AI. I verify everything."

Journalist: "So how do you know it's actually production-ready?"

Totto: "Because the verification catches real bugs. Issue #180 -- the ClosedFileSystemException -- appeared hours after shipping the nested ZIP feature. The real test corpus exposed it immediately. We fixed it same day. That's the point: verification cheaper than debugging. The bugs that would have been catastrophic in production got caught in minutes, not weeks."

The skepticism is healthy. The paranoia is productive. The verification infrastructure turns both into confidence.

What's Next

More edge cases will appear. Packages with Chinese characters in filenames. EAGLE's naming conventions. Altium's proprietary extensions. KiCad 8's new features.

But the methodology is proven. The four-day sprint wasn't luck -- it was the method working as designed.

For developers considering AI assistance: these patterns work. The productivity gains are real. But they require discipline:

  • Build context first: Document architecture, patterns, anti-patterns (Pillar 1)
  • Delegate strategically: Use cheaper models for pattern-following work (Pillar 2)
  • Verify rigorously: Real user data, automated tests, tight feedback loops (Pillar 3)
  • Stay in control: Orchestrate multiple AIs for different purposes, but you synthesize and decide (Pillar 4)
  • Enforce process: 100% PR workflow, automated CI, documentation-as-code (Pillar 5)
  • Capture learnings: Every bug updates the context, system gets smarter (Pillar 6)

The next problematic user file is already waiting in someone's inbox. The question is whether your system is ready to handle it automatically, or whether it'll become another support ticket.

"ok, lets make a great PR?"

"merged"


Appendix: By The Numbers

Four-Day Sprint (Jan 29, 2026)

Metric Value
Pull Requests 4 (PRs #179, #181, #182, #183)
Lines Added 16,300+
Tests Added 50+ across 11 test classes
Test Failures (final) 0
Bug Discovery -> Fix Same day (Issue #180 -> PR #181)
Releases 2 (v0.51.4, v0.51.5)

Full Project (11 Days, Jan 16-26, 2026)

Metric Value
Total Java LOC 197,831
Production Code 80,644 lines
Test Code 117,187 lines (1.45:1 ratio)
Total Tests 7,461 (99.8% pass rate)
Commits 474 (all via PRs)
Releases 55 (5.0 per day)
Documentation 61,879 lines (80 files)
Claude Skills 54 domain-specific guides
Format Parsers 8 (MIF, Gerber, Excellon, Altium, IPC-2581, KiCad, ODB++)
Validators 28 classes
Auto-Fix Types 17

Productivity Metrics

Dimension Value
LOC per Day 17,985
Tests per Day 678
Base Productivity 25-66x (11 days vs 10-18 months)
Velocity Retention 90% (Day 1 to Day 11)
Cost Savings 80-85% (via Haiku delegation)

Real Packages Handled

Package Challenge Result
256916.zip Numeric filenames, .gbrjob metadata 2 copper layers detected, 572 KB MIF
252294.zip 0/8 files matched (renamed) 7/8 recovered via smart matching
vsroxar...zip Nested depth 2, DCM patterns 21 files extracted, 6 copper layers

This is a true story. The code is real, in production, open source. The productivity metrics are measured, not estimated. The collaboration patterns work.

Project: lib-pcb - PCB Manufacturing Automation Library

Co-Authored-By: Claude Sonnet 4.5