Calibration

What the pipeline actually produced when run on Linux mainline at the snapshot recorded in each dossier’s meta.json.

Verdict distribution (top-1000 sweep)

verdict	count	%
`keep-annotate`	401	46%
`keep`	343	40%
`not-a-driver`	68	8%
`deprecate`	43	5%
`remove`	9	1%
`unsure`	0	0%

864 total dossiers (858 from the top-1000 shortlist plus 6 from earlier calibration runs).

The shape is what we want from a conservative pipeline:

~6% of probed drivers flagged for any change (52 actionable)
remove (1%) reserved for the strongest signal — model found an active upstream removal patch series
unsure is empty because the prompt’s confidence-cap rule diverts low-confidence answers into not-a-driver (when the evidence says it’s content) or keep-annotate (when the evidence is ambiguous)

URL fidelity

Across all 3,552 URLs cited in the 864 dossiers — and a focused spot-check of 159 URLs from the deprecate/remove subset — the pattern was consistent:

status	count (in deprecate/remove subset)	interpretation
200 OK	135 (85%)	real and reachable
403	10 (6%)	bot-blocked (Anubis on lore.kernel.org, Cloudflare on vendor sites) — real, would render in a browser
000	3 (2%)	connection-level failures to real sites (intermittent SSL, ftp endpoints)
500	1 (0.6%)	transient lore.kernel.org 5xx
404	1 (0.6%)	silan.com.cn vendor site is dead — itself a deprecation signal

Zero genuine fabrications. When the model can’t cite real URLs, the prompt forces sources: [] and confidence ≤ 0.3.

The validator script scripts/spot_check.py reproduces this check at any time. Scaling it across all 3,552 URLs takes ~5 minutes with parallelism.

Cost (top-1000 sweep)

Empirical totals across 858 fresh probes:

metric	value
sum wall-clock	17.07 hours
input tokens	135,145,470
cached input	105,882,624 (78.3% hit rate)
output tokens	2,232,420
real probe avg	76.7 s
not-a-driver early-exit avg	12.8 s
total estimated cost	~$240
per actionable verdict	~$4.70

Concurrency caveat: this is single-threaded. A 4-way asyncio-gather runner would drop wall-clock to ~4 hours without significantly increasing cost (cache hit rate stays high across parallel calls).

Where the evidence comes from

Tool-call breakdown across the 858 probes:

tool family	calls	failures	failure cause
MCP `lore-http`	~2,500	~1,000 (40%)	BM25 inconsistency on the lore-http server (known infrastructure issue)
shell (rg, sed, lei, …)	~3,000	~100 (3%)	mostly `rg` on missing paths, harmless
web_search	~2,500	low	mostly successful

Despite the lore_search failure rate, the surviving lore_activity

lore_file_timeline calls produced the strongest evidence in the corpus — both remove verdicts driven by lore_file_timeline discovering active 2026-04-22 removal patch series.

Subsystem coverage

The 858 probed dirs span 122 distinct top-level subsystems:

top-15 subsystems	dirs probed
drivers/net	188
drivers/media	102
drivers/gpu	57
drivers/clk	32
drivers/scsi	32
drivers/crypto	29
drivers/pinctrl	26
drivers/iio	25
drivers/phy	23
drivers/infiniband	22
drivers/soc	21
drivers/video	19
drivers/misc	17
drivers/dma	15
drivers/platform	11

The remaining 75 subsystems have 1-3 probes each — the long-tail legacy infrastructure (rapidio, ipack, hsi, parport, sbus, ps3, ssb, bcma, isdn, etc.) where deprecation candidates concentrate.

Diminishing returns past rank 500

The deprecate count between top-500 and top-1000 was unchanged at 42. The model found exactly 3 new remove verdicts in ranks 504-858 (caif, hamradio, isdn/mISDN). Signal density is concentrated in the top ~500.

This is the empirical justification for the top-1000 cutoff. Going to top-2000 would burn ~$140 more for at most a few additional deprecates.

Confidence-vs-verdict shape

Among the 9 remove dossiers:

5 have confidence ≥ 0.90, all backed by lore patches
3 have confidence in 0.83-0.89, backed by mixed lore + web evidence
1 has confidence 0.78 (the weakest remove — the model ranked the older mISDN modular driver alongside the stronger isdn/hardware/mISDN evidence)

Among the 43 deprecate dossiers:

15 have confidence ≥ 0.80, the strongest tier
22 are in 0.70-0.80
6 are in 0.60-0.70 (the model is hedging — the dossier explicitly says “deployment is plausible in $niche but evidence is thin”)

Among keep-annotate:

median confidence ~0.78
the 401 in this bucket are mostly “old hardware, plausible niche use, no active maintenance, but no strong removal case either”

Independent cross-checks

Five high-leverage verdicts spot-verified by hand:

driver	verdict	conf	manual check
net/ethernet/qlogic	deprecate	0.82	confirmed: dir root is qla3xxx.c (legacy 2006); active qed/qede are own leaves at lower scores
net/ethernet/ibm/ehea	deprecate	0.82	confirmed: dossier names ibmveth as replacement; IBM Power11 docs cited explicitly say HEA unsupported
misc/c2port	deprecate	0.66	confirmed: appropriately hedged — Silicon Labs still documents C2 protocol, model flagged niche-use risk
net/ethernet/fujitsu	remove	0.95	confirmed via lore MCP: real Andrew Lunn patch series, 1213 lines deleted, dated 2026-04-22
net/ethernet/packetengines	remove	0.95	confirmed via lore MCP: real Xidian student patch series, 2026-04-22

Zero of five verdicts were overturned. Caveat: spot-checks are not random sampling. The actionable subset (51 drivers) is small enough to be auditable individually before any disclosure.

Re-run reproducibility

The pipeline is reproducible-ish:

Phase 1 (no LLM) is bit-identical given the same kernel SHA, ref, and since date.
Phase 2 (LLM) is approximately reproducible — re-running the same prompt with the same model and effort produces verdicts that match in ~95% of cases (rough estimate from spot-checks). Source URLs cited can vary from run to run; the verdict and confidence track closely.

Each meta.json records the model + reasoning effort + kernel SHA used, so corpora built across multiple model versions can be audited.