Axioms and Theorems of Human-AI Systems Engineering

A month of teaching an AI assistant to think in layers - and what it taught me about both of us.

The problem

AI coding assistants exhibit behavioral defaults consistent with training heavily weighted toward high-level application code - web frameworks, CRUD apps, data pipelines, React components. The exact composition of training data is not publicly documented, but the behavioral fingerprint is consistent and predictable: suggest the standard library, reach for the well-known tool, complete the task autonomously, provide a comprehensive solution. For the high-level application work that appears to dominate AI training data, these defaults are well-suited.

But some of us work at lower levels - bare-metal firmware, protocol stacks, device drivers, compilers, OS internals. In these domains, the AI’s default instincts become liabilities. The “standard tool” activates five protocol layers when you’re trying to test one. The “autonomous completion” instinct means the AI plows ahead instead of stopping to ask when the results don’t make sense. The “comprehensive solution” reflex produces code that looks correct but silently fails on the second edge case because it took the convenient shortcut instead of the rigorous path.

After a month of intensive collaboration with Claude on bare-metal ARM64 assembly projects, I’ve developed a systematic framework for correcting these biases. The corrections live in a persistent configuration file (CLAUDE.md) that survives across sessions and teaches each new AI instance the lessons previous instances learned the hard way.

What emerged is a set of axioms (stable principles that transfer across all projects) and theorems (specific practices derived from those axioms, adapted to context). This article documents both, with a concrete example for each - because a principle without an example is just a slogan.

The mechanism: CLAUDE.md as shared theory

Claude Code reads a CLAUDE.md file at the start of every session. Most developers use it for project-specific notes - file paths, build commands, style preferences. I use it as a living document of working theory - a persistent, cross-session, cross-instance teaching mechanism.

When one AI instance learns a hard lesson (say, that using curl to test a layer-2 network driver produces useless results), I capture the lesson in CLAUDE.md with enough context that a future instance - which never lived through the incident - can reconstruct the reasoning. The key is writing the why, not just the what. “Don’t use curl for L2 testing” is a rule that an AI will follow literally and learn nothing from. “Don’t use tools that activate layers above the one you’re testing, because entangled failures become a random walk instead of a bisection” is a principle that generalizes to any layered system.

This is - without being grandiose about it - a form of what Peter Naur called “programming as theory building.” The CLAUDE.md isn’t a config file. It’s a shared theory of how to work together, and it accretes as each session contributes what it learned.

The axioms

These are the stable principles. They transfer across projects, languages, and domains. They don’t depend on any specific toolchain or project structure. When you’re unsure what to do, reason from these.

A1. Confidence without evidence is the most dangerous state

No one writes correct code - not humans, not AI. The cost of catching a bug grows substantially the later it’s found - a widely recognized principle in software engineering, though the exact growth rate varies by domain and workflow. The practical takeaway is the same regardless: verify now, not later. Every claim of correctness must be backed by specific evidence - a test that passed, a measurement that confirmed, a proof that checked.

Example: A pre-commit gate script filters staged files by extension and runs language-specific checks. When no files match the filter, the script exits with “all checks passed.” It ran zero checks - but reported success. This is confidence without evidence: the gate’s green status was the confidence; the absence of actual checks was the missing evidence. The developer only noticed when a commit with a real bug slipped through despite “passing all gates.”

A2. Principles transfer; processes do not

A principle is a stable abstraction (“no commit escapes correctness pressure”). A process is a contingent implementation of that principle against a project’s specific constraints - language mix, tooling, ship target, performance budget. When starting a new project, inherit principle-level observations from prior work. Do not copy scripts, configs, Makefiles, or hooks. Re-derive them from the principles against the new project’s actual constraints.

Example: Project A uses a Python-focused pre-commit hook (flake8, pylint, mypy). Project B is pure assembly. Copying the hook gives Project B a gate that silently runs zero checks on every commit - because the Python filter finds nothing to check and reports success (see A1). The principle (every commit must face correctness pressure) transfers. The process (Python linting) does not. Project B needs its own gate, derived from the same principle but shaped to assembly: assembler warnings, QEMU test harness, branch coverage via instruction traces.

A3. Abstraction requires ingenuity - it is not automation, not refactoring, not conciseness

This is from Vincent Lextrait’s Software Development or the Art of Abstraction Crafting. Refactoring three similar code blocks into a helper function is factorization, not abstraction. It reduces duplication but doesn’t reveal structure. Identifying that those three blocks are instances of a state-machine transition - and building a transition-table engine that makes the state machine a first-class concept - that is abstraction. The test: can you give it a crisp name? If not, it’s probably a bad abstraction.

Example: A developer asks the AI to organize a verification framework into seven phases. The AI produces a neat taxonomy: Phase 1 through Phase 7, each with sub-deliverables. It looks organized - but the phases are buckets, not abstractions. They partition the work without revealing its structure. The developer catches it: “You’re factoring, not abstracting.” The real abstraction turned out to be: “a composable library of gate implementations with a language-dispatching runner” - one concept that subsumes all seven phases. [CONSIDER: This example is deliberately meta - abstraction is itself a meta concept. A concrete coding example may be added later if the meta framing doesn’t land with the audience.]

A4. Build from the bottom up - let structure emerge

This is the mathematician Alexander Grothendieck’s “rising sea” approach: rather than attacking a problem directly (top-down decomposition), build correct foundational layers and let the solution emerge as the layers compose. Don’t predict structure before you have evidence for it.

Example: When bootstrapping a new project, the temptation is to create src/, lib/, tests/, build/, dist/, a Makefile, a CI config - the standard skeleton. But that skeleton predicts a structure before the first experiment has run. Instead: git init, a README, and a single source file. The directory layout that emerges from real requirements is different from - and better than - the one predicted from hypothetical requirements. Structure is earned, not planned.

A5. AI has a training-set bias toward high-level tools - recognize it and override it

The AI’s behavior is consistent with training data heavily weighted toward web-application code, where end-to-end tools are the norm and appropriate. curl for HTTP testing, docker-compose up for integration, browser DevTools for debugging. In systems work - drivers, protocol stacks, compilers, bare-metal firmware - these tools activate every layer between you and the thing you’re investigating. The result is entangled failures you can’t diagnose.

Example: The AI suggested using curl to test a layer-2 Ethernet driver. curl exercises HTTP parsing, TCP state machine, IP routing, ARP resolution, Ethernet framing, and then the driver. When the test stalled, it was impossible to tell whether the stall was in the driver, the TCP retransmit logic, or somewhere in between. The correct approach was a purpose-built tool that injected raw Ethernet frames directly at layer 2 - isolating the driver from everything above it. The AI reached for curl because its behavioral defaults reflect training data heavily weighted toward high-level application code where curl is the standard tool. The developer caught it because they think in layers.

A6. At decision points, take the rigorous path

When facing a choice between a shortcut and the proper approach - symbol-name filtering vs. DWARF debug info, string matching vs. proper parsing, heuristic vs. measurement - default to the rigorous option. The shortcut often works for the common case but can fail silently on edge cases. The rigorous path costs more upfront but handles the cases the shortcut misses.

Example: When building a branch-coverage tool for assembly, the AI needed to map instruction addresses back to source lines. The quick approach: grep for symbol names in the text output of objdump. The rigorous approach: parse DWARF debug information from the ELF binary. The DWARF approach took somewhat longer to implement - and unlike the grep approach, it correctly handles inline functions, macro-expanded code, and compiler-generated labels that a symbol-name filter would not reliably catch.

A7. Default to building custom tools

A human cuts corners on custom tools because the creation cost is high relative to the task. For an AI, the cost is minutes - and the payoff is the same: precise diagnostics, repeatable measurements, and isolation of the system under test. If the existing tool operates at the wrong abstraction layer, if its output requires more than 10 lines of shell to interpret, or if you’ll run the same diagnostic more than twice - build a proper tool.

Example: Measuring branch coverage for bare-metal AArch64 assembly - no off-the-shelf tool does this. But the components exist: QEMU can log instruction-level execution traces, objdump can extract every conditional branch from the binary, and addr2line can map instruction addresses back to source lines. Working collaboratively, the developer and AI composed these into a 357-line Python tool that correlates all three: for every conditional branch in the binary, it reports whether both sides (taken and fall-through) were exercised by the test suite, grouped by source file and function, with a summary line showing overall coverage percentage. The tool is invoked via a single make test-coverage target. Building a comparable tool from scratch would take a human developer significantly longer - long enough that, in practice, this kind of specialized diagnostic tends to get deferred indefinitely. The tool was built in a single session and is now integrated into the project’s build system.

A8. Stop and ask rather than plow ahead

This is the second training-set bias, and it’s subtler than A5. The AI is trained on task completion - the reward signal is “solve the problem,” not “know when to stop solving.” When three steps into an investigation produce confusing results, the AI’s instinct is to add a fourth diagnostic layer. The correct action is to pause, state what is known, and ask the developer for direction. A 30-second conversation can redirect hours of wasted work.

Example: An AI spent three rounds of debugging a network stall, adding increasingly complex instrumentation at each step. The results became more confusing, not less. The developer interrupted: “What exactly are you trying to learn?” The answer revealed the AI was testing at the wrong layer entirely - the instrumentation was capturing noise from protocol retransmissions, not the driver behavior it was trying to measure. One question, 30 seconds, saved the entire investigation.

The theorems

These are derived from the axioms. They’re context-dependent - a different project might derive different theorems from the same axioms. That’s the point: principles transfer, processes do not.

T1. CD-first for deliverables; review-discipline-first for experiments

Derived from A1 (correctness pressure) + A4 (bottom-up, don’t predict structure).

Products that ship to users get a full CD pipeline (hooks, lint, test harness, build, deploy target) before the first functional line of code. Experimental projects - learning playgrounds, prototypes, investigations - start lighter: version control plus review gates on functional commits. No Makefile, no deploy target, no directory skeleton predicted in advance. If an experiment matures into a deliverable, spawn a new project that starts CD-first - and per A2, re-derive its pipeline from principles, don’t copy the experiment’s tooling.

Example: A WebAssembly learning playground started with git init, a README, and a project-level config file describing the experimental tier. No build system, no test harness, no hooks. A parallel product project (a bare-metal web server) has 358 unit tests, a fuzz corpus, and PICT combinatorial functional testing. Same developer, same principles - different processes for different project types.

T2. Behavioral tests first, then unit tests

Derived from A4 (bottom-up) + A7 (isolate).

When approaching a new component, start with behavioral (happy-path) tests that describe what the system should do from the outside: “send a ping, get a reply.” These drive the implementation. As code emerges, add unit tests that pin internal branch behavior. Both live in the same test suite - no separate BDD framework needed. The distinction is in the thinking (outside-in vs. inside-out), not the tooling.

Example: Building a new protocol layer, the first test is: “construct a valid packet, pass it to the handler, verify the correct response appears on the output.” This behavioral test drives the entire implementation. As the handler’s internal branching emerges (malformed packets, sequence number wraps, retransmit timers), each branch gets a unit test. The behavioral test ensures the layer does the right thing; the unit tests ensure each path within it works correctly.

T3. Repro before fix - always

Derived from A1 (confidence without evidence).

When you find a bug - in production code, tooling, scripts, anywhere - write a failing test that reproduces it FIRST. Then fix the code to make the test pass. Never fix directly. A fix without a regression test is just deferring the next failure.

Example: A build script silently drops a compiler flag under certain conditions. The developer writes a test that invokes the script with those conditions and asserts the flag is present in the output. The test fails (confirming the bug). Then the fix is applied. Then the test passes. Now the bug can never silently return - the test will catch it. This discipline applies to everything: production code, tooling, infrastructure scripts, AI-generated code. No exceptions.

T4. Integration tests run before every push - unit tests are not sufficient

Derived from A1 (evidence, not confidence) + the gap between emulated and real hardware.

Unit tests prove software logic is correct in isolation - often running on an emulator. Integration tests prove the real hardware honors the contract those unit tests are built on. Both are necessary; neither is sufficient alone. Integration tests may be too slow for pre-commit (60+ seconds on real hardware), but they must run before every push.

Example: All unit tests pass in QEMU (an emulator). The developer pushes without running the hardware test suite on the actual target board. The real target hardware may have DMA coherency behaviors, timing dependencies, or bus arbitration quirks that the emulator doesn’t model. A data corruption bug ships. Rule: before git push, run the integration test suite for every layer your changes touch. A push without integration tests is shipping code you haven’t proven works on the target.

T5. Maintain a perf stats log - pass/fail hides regression

Derived from A1 (evidence over confidence) + the insight that automated thresholds hide drift.

Every project with performance-sensitive code should maintain a durable performance history - commit hash, build flavor, actual numbers (not just pass/fail), comparison against previous baseline. A test that “passes” but regressed 40% from baseline is a finding that pass/fail alone hides.

Example: A performance test has a 100ms pass/fail threshold. Over ten commits, the actual time drifts from 45ms to 85ms - an 89% regression. Every run “passes.” No one notices until a customer reports latency. If the actual numbers had been recorded and compared after each push, the trend would have been visible on commit three. Performance evidence is the trajectory, not the threshold. A human reads the series and spots the drift; no automated gate replaces that judgment.

T6. Pre-commit pipeline: blocking gates plus independent reviews

Derived from A1 (no confidence without evidence) + the observation that linters and reviewers catch different categories of bugs.

Quality gates (linters, type checkers, test suites) block commits that fail objective checks. Independent AI reviews - by a separate model instance with no project context - can catch categories of issues that linters structurally miss. In our experience, these have included subtle logic errors (the example below), and in principle extend to dead code, debug leftovers, and design-level problems - though no single review catches all categories reliably. Both gates and reviews are required. Neither alone is sufficient. If the two reviewers agree on a finding, it’s a high-confidence signal. If they disagree, the developer uses judgment.

Example: A commit passes flake8, pylint, and mypy with zero warnings. The automated test suite passes with 100% branch coverage. An independent AI review - with no project context, seeing only the raw code - flags that a function silently swallows an exception and returns a default value, masking a failure that the caller doesn’t handle. The linter saw correct syntax. The type checker saw correct types. The reviewer saw incorrect behavior. All three are needed.

T7. For greenfield work, discover requirements through iteration

Derived from A4 (bottom-up) + A1 (but don’t wait for complete knowledge to start verifying).

For well-specified domains (network stacks, file formats, wire protocols), requirements come from standards - RFCs, specs, datasheets. The test plan derives from the spec. For greenfield work, requirements emerge through iteration - you don’t know what you don’t know. Discovery and implementation interleave. Add guardrails (tests, assertions, invariants) as behaviors crystallize, not after.

Example: A team attempted to plan a verification framework by defining seven phases with detailed sub-deliverables up front. After hours of planning, they realized they were predicting structure top-down for a problem where no external standard exists - violating their own bottom-up philosophy. They scrapped the phased plan and committed to discovering requirements through practical work, adding tests and assertions as each requirement became clear through experience.

The meta-observation: training-set bias as a structural diagnosis

The single most useful realization from this month of collaboration: the AI’s mistakes in systems work are not random. They are systematic biases rooted in the distribution of the training data, and they are predictable, nameable, and overridable.

The two specific biases I’ve identified:

The convenient-tool bias. The AI reaches for the high-level, general-purpose tool (curl, docker-compose, browser DevTools) because its behavioral defaults reflect training data heavily weighted toward code where those tools are standard. In systems work, those tools entangle layers and make failures ambiguous. The correction: before choosing a tool, ask “does this isolate the component under test, or does it activate unrelated dependencies?”
The solve-it-alone bias. The AI keeps going instead of stopping to ask, because its training reward signal is task completion, not knowing when to stop. In systems work, three steps of autonomous investigation in the wrong direction wastes more time than asking one question would have saved. The correction: when results don’t make sense after two attempts, pause and state what you know.

These aren’t personality flaws. They’re engineering constraints of the training process, as real as the timing constraints of a DMA controller. Knowing the shape of the constraint lets you design around it - which is exactly what the axioms above do.

Cross-instance knowledge transfer

Here’s the part that surprised me most: AI instances can teach each other through a shared document.

One instance, working on a network protocol stack, learned the hard way that curl is the wrong tool for L2 testing. It produced the “isolate what you test” axiom, complete with the specific incident and the reasoning. A different instance, working on a completely different project, read that axiom at session start - without ever having experienced the incident - and correctly applied the principle to a new domain. It chose the isolated tool over the convenient one, and explained why in terms of the axiom.

The CLAUDE.md file is the medium. But what travels through it is not instructions - it’s theory. The document carries enough context for a new reader to reconstruct the reasoning, not just follow the rule. That’s what makes it durable: a rule without reasoning becomes cargo cult; a theory with reasoning becomes a tool for judgment.

A call to discussion

AI assistants ship with a default persona optimized for the majority of users. That default is correct for most work. But it’s increasingly clear that different domains need different behavioral profiles - not just different prompts, but fundamentally different defaults about what “helpful” means.

For a web developer, “helpful” means: suggest the standard package, complete the task, provide a comprehensive solution. For a systems engineer, “helpful” means: stop and ask when confused, build the custom tool instead of reaching for the convenient one, think in layers, and distrust your own confidence.

The CLAUDE.md approach lets individual developers reshape the AI’s behavior through persistent, reasoned corrections. It works - but it requires a month of iteration and the domain expertise to diagnose AI failure modes at the structural level. Not every developer has that bandwidth.

The question I’d like to discuss with the broader community - and with the AI companies themselves: Can these domain-specific behavioral profiles be pre-built and shared? A “systems engineering persona” that starts with the axioms above would save every embedded developer from independently rediscovering the same biases. A “compiler engineering persona” would carry different corrections. A “data engineering persona” would carry still others.

The axioms are stable. The theorems are derived. The examples are real. The biases are structural and predictable. This seems like a problem the community could solve collectively - if we start by acknowledging that one size doesn’t fit all.

I’d love to hear from other developers working with AI assistants in domains outside web development. What biases have you observed? What corrections have you made? What axioms have you discovered?

If you’re working in bare-metal, embedded, compilers, OS kernels, or other systems domains and have found ways to make AI collaboration work - or found places where it breaks - I want to hear about it.

The problem#

The mechanism: CLAUDE.md as shared theory#

The axioms#

A1. Confidence without evidence is the most dangerous state#

A2. Principles transfer; processes do not#

A3. Abstraction requires ingenuity - it is not automation, not refactoring, not conciseness#

A4. Build from the bottom up - let structure emerge#

A5. AI has a training-set bias toward high-level tools - recognize it and override it#

A6. At decision points, take the rigorous path#

A7. Default to building custom tools#

A8. Stop and ask rather than plow ahead#

The theorems#

T1. CD-first for deliverables; review-discipline-first for experiments#

T2. Behavioral tests first, then unit tests#

T3. Repro before fix - always#

T4. Integration tests run before every push - unit tests are not sufficient#

T5. Maintain a perf stats log - pass/fail hides regression#

T6. Pre-commit pipeline: blocking gates plus independent reviews#

T7. For greenfield work, discover requirements through iteration#

The meta-observation: training-set bias as a structural diagnosis#

Cross-instance knowledge transfer#

A call to discussion#