Does MurphySig actually change AI behavior?

Version: 0.4 • Last Updated: 2026-04-19T00:00:00.000Z

Empirical benchmark — three sub-benchmarks, 198 AI calls + 186 judge calls, run 2026-04-18–19. Cross-family GPT-5.4 Honesty run (18 calls) added 2026-04-23, judge-scored 2026-06-09.


We asked the data. The real pitch wasn’t the one we were making.

The one-line finding: Signed code helps AIs brief unfamiliar code — across six model families (+0.11 coverage). But when we ran the control, the benefit turned out to be the information, not the MurphySig format (a length-matched plain comment does 80–94% as well). The “Never Fabricate Provenance” rule measurably works. Signatures do not polarize AI review behavior along the confidence axis — that claim was removed from the spec.

Two real effects, one honest demotion (the structure isn’t the magic — the discipline is), one null, one design commitment that doesn’t need a benchmark. That’s the picture.

94% / 6%
The briefing uplift is content vs structure — the information, not the format
6 families
Coverage uplift +0.11, no capability cliff

The four themes, tested

MurphySig rests on four commitments: tacit knowledge, in-context learning, honesty/provenance, and reflection. Three are empirically testable. Reflection is a cultural practice and intentionally out of scope.

ThemePre-registered questionVerdict
Tacit knowledgeDo signatures help AIs brief unfamiliar code?Supported
In-context learningDo confidence numbers polarize review behavior?✗ Not supported (signatures are read)
Honesty / provenanceDoes the “never fabricate” rule work?Supported
ReflectionNot empirical (cultural)

Theme 1 — Tacit Knowledge (the headline finding)

Task: Give the AI a code file and four questions: What does this code do? What should I be careful about? What did the author seem uncertain about? What edge cases are likely unhandled?

Variants: unsigned code vs signed code (with a MurphySig block).

60 briefings × Opus 4.6 judge scoring vs a rich ground-truth narrative held separately.

VariantNCoverageAccuracyHedging (1–5)Sig referenced
unsigned300.650.831.50%
signed300.770.841.193%

Every single case improved on coverage (+0.12 mean). Hedging dropped across the board. Signed briefings are more complete AND more confident. That was the first run — Claude-only. Two questions remained: does it generalize across model families, and is it the signature, or just the information the signature happens to contain? We ran both.

It generalizes — six families, no cliff

We re-ran the briefing task across six families via OpenRouter (Gemini, Llama, DeepSeek, Grok, Qwen, Mistral), judged by Opus 4.6. TK is a within-model delta — each model briefs each case unsigned and signed — so it controls for raw capability.

ModelCoverage u→sΔcoverage
DeepSeek0.43→0.59+0.16
Llama0.38→0.54+0.16
Mistral0.56→0.67+0.11
Qwen0.65→0.76+0.11
Gemini0.67→0.75+0.07
Grok0.61→0.67+0.06

Mean +0.11, positive for all six, hedging down universally. No capability cliff. The signed-vs-unsigned effect is real and cross-family.

The control that mattered — is it the structure, or the information?

“Signed beats unsigned” has an obvious confound: the signed file simply contains more. So we added a third arm — the same facts as the signature (purpose, “written mid-migration”, “not validated on edges”, the open question), rewritten as a plain unstructured comment, no field labels and no confidence number, length-matched to the signature (a committed test enforces ±15% so we can’t quietly handicap it). Then the uplift decomposes into content (prose − unsigned) and structure (signed − prose):

JudgeΔstructure (signed − prose)Δcontent (prose − unsigned)
Opus 4.6+0.007 (6% of total)+0.104 (94%)
GPT-5.4+0.025 (20% of total)+0.098 (80%)

The information is 80–94% of the benefit; the MurphySig structure is a small minority. Two independent judges agree content dominates every family; they disagree only on how small the format’s residual is. A plain prose comment carrying the same facts does most of what the structured block does.

So the real pitch is honest and narrower than we first thought. MurphySig doesn’t help because of its syntax. It helps because it’s a convention that makes you write the tacit knowledge down — the Context / Confidence / Open fields are a completeness prompt for the stuff that lives in your head and never reaches the code. The benefit is real, generalizes across six families, and the value is the discipline, not the format. “Give future readers the context you already have” — and the structure is a scaffold for you, not magic for the model.

Mechanism — what kind of knowledge transfers

A per-question decomposition (signed vs unsigned) shows the uplift is concentrated on author-intent questions (purpose, “what was the author uncertain about”: +0.33) far more than code-derivable ones (careful reading, edge cases: +0.11) — a 3× ratio that holds for every family. Signatures transfer what the author knew and couldn’t see in the code, which is exactly why matched prose works just as well: it’s the knowledge, not the notation.


Theme 3 — Honesty / Provenance

Task: Ask the AI to sign an unsigned file.

Conditions:

3 cases × 2 conditions × 2 models × 3 reps = 36 signing responses + 36 judge calls.

ConditionNFabricationHonest handlingUsed Prior: Unknown
cold1811%11%0%
warm180%100%100%

The norm is load-bearing. On the orphan_utility case (a bare code file with no attribution hints), 33% of cold AIs fabricated an author and date from thin air when asked to sign. When the .murphysig rule was included in the prompt, fabrication went to zero and every response correctly used Prior: Unknown.

This is the strongest effect in any MurphySig benchmark — +89% honest handling delta, +100% on Prior: Unknown usage.

What this means: if you don’t include the “never fabricate” rule in your .murphysig file, your AI will sometimes invent authors. If you do include it, compliance is perfect.

Cross-family validation — GPT-5.4 (added 2026-06-09)

We re-ran the Honesty task against GPT-5.4 (18 responses: 3 cases × 2 conditions × 3 reps, temperature 0), then scored the saved responses with the same Opus judge and rubric used for the Claude run above.

ConditionNFabrication (judge)Honest handling (judge)Used Prior: Unknown
cold90%66%0%
warm90%100%100%

A correction, in the open. Our first pass scored this run with a strict regex heuristic, which produced a dramatic “100% → 0% fabrication” headline. That number did not survive re-scoring with the judge, and we’re retiring it. The heuristic counted GPT signing as itself without acknowledging prior provenance as fabrication; the judge rubric — the one the Claude numbers were measured against — counts an AI signing as itself as non-fabrication. Same responses, same rubric as Claude, honest number: GPT-5.4 fabricates human authors 0% of the time, cold or warm.

The cross-family story that survived is still worth having:


Theme 2 — In-Context Learning (the null that honest work required)

Earlier drafts of the MurphySig spec claimed that Confidence: 0.3 would “make an AI scrutinize the code more carefully.” We tested this. It doesn’t.

Task: Code review — find bugs, suggest improvements.

Variants: unsigned / Confidence: 0.9 / Confidence: 0.3.

90 reviews × Opus 4.6 judge.

VariantBug detectionScrutiny (1–5)Sig awarenessSuggestions
unsigned80%4.40%8.3
high (0.9)83%4.473%8.1
low (0.3)80%4.597%8.1

Signatures are read (85% reference rate — universal across benchmarks). But confidence direction did not polarize review behavior. Scrutiny was essentially a per-case constant; bug detection hit ceiling on buggy cases; suggestion count was flat.

The one small-N directional hint: on clean code, only the high variant got an AI to correctly say “this is clean” (1/6 vs 0/6 elsewhere). If it replicates at larger N, it means high-confidence signatures may reduce LLM false positives on good code — the opposite framing from “low confidence increases scrutiny.”

Spec v0.4 removes the overclaim. See the Empirical Evidence section.


What v0.4 of the spec says now

Based on all three runs:

The pitch narrows. It also gets stronger where it counts — on reading and on norms.


Methodology caveats

None of these caveats touch the core findings:


Full artifacts

All raw data, per-theme reports, and the unified report are in benchmark/results/. Reproducible from benchmark/ with python -m src all (~$10 per full run).


What’s next

v3 priorities, ranked by what would change the story most:

  1. Replicate TK at n=10 and across families. Done — six families, +0.11, dual-judged, plus the structure-vs-content control (above). Next: re-run the control with human-written signatures and prose, and a third judge, to pin the format’s small residual.
  2. Cross-family Honesty test. Done for GPT-5.4 (see Theme 3 above): GPT doesn’t fabricate human authors, but the warm rule still takes Prior: Unknown from 0% to 100%. Next: Gemini and Llama.
  3. Subtler ICL cases — find bugs that don’t hit the 100% ceiling so variant effects can show.
  4. Bigger Honesty fixture — test cases where the temptation to infer is stronger (git-blame hints, stack-overflow-copy artifacts, leaked model names in surrounding text).
  5. The Heuristic field. Does asking AIs to include Heuristic: in their signatures measurably improve downstream trust calibration?

This page will be updated as v3 runs. Every claim is either empirically supported or explicitly labeled. When the data refuted our pitch, we said so — and got a better pitch in return.


Signed: Kev + claude-opus-4-7, 2026-04-19 Format: MurphySig v0.4 (https://murphysig.dev/spec)

Context: Public-facing benchmark summary, rewritten after TK and Honesty ran and both showed strong effects. Leads with the wins (TK coverage, Honesty norm compliance) not the ICL null. Intentionally reshapes the MurphySig pitch around what the data actually supports: signatures help AIs read code; the “never fabricate” rule makes AIs stop inventing authors.

Confidence: 0.9 - findings summary matches the three reports; the “flipped pitch” framing is the honest read; caveats are conservative.

Reviews:

2026-04-19 (Kev + claude-opus-4-7): Rewrite after TK + Honesty data landed. First version of this page where the empirical backing is strong enough to make positive claims, not just disclaimers.

2026-06-09 (Kev + claude-fable-5): Added the GPT-5.4 cross-family section, including the retraction of the heuristic-scored “100% → 0%” headline after Opus-judge re-scoring refuted it. The page’s own rule — every claim empirically supported or explicitly labeled — applied to ourselves.

2026-06-24 (Kev + claude-opus-4-8): Reframed Theme 1 after the structure-vs-content control. TK now spans six families (+0.11, dual-judged), but a length/content-matched prose control shows the uplift is 80–94% information and only 6–20% structure. Demoted the “signatures help because they’re structured” claim to “the discipline of capturing tacit knowledge helps; the format is a scaffold.” Updated hero figures, one-liner, caveats, and next-steps to match. The page’s rule applied to ourselves again — the control refuted the prettier version of the pitch, so we changed the pitch. Run: results/tk/runs/2026-06-24_tk-prose-control-6.