Calibration

What the pipeline actually produced when run on Linux mainline at the snapshot recorded in each dossier’s meta.json.

Verdict distribution (top-1000 sweep)

verdictcount%
keep-annotate40146%
keep34340%
not-a-driver688%
deprecate435%
remove91%
unsure00%

864 total dossiers (858 from the top-1000 shortlist plus 6 from earlier calibration runs).

The shape is what we want from a conservative pipeline:

  • ~6% of probed drivers flagged for any change (52 actionable)
  • remove (1%) reserved for the strongest signal — model found an active upstream removal patch series
  • unsure is empty because the prompt’s confidence-cap rule diverts low-confidence answers into not-a-driver (when the evidence says it’s content) or keep-annotate (when the evidence is ambiguous)

URL fidelity

Across all 3,552 URLs cited in the 864 dossiers — and a focused spot-check of 159 URLs from the deprecate/remove subset — the pattern was consistent:

statuscount (in deprecate/remove subset)interpretation
200 OK135 (85%)real and reachable
40310 (6%)bot-blocked (Anubis on lore.kernel.org, Cloudflare on vendor sites) — real, would render in a browser
0003 (2%)connection-level failures to real sites (intermittent SSL, ftp endpoints)
5001 (0.6%)transient lore.kernel.org 5xx
4041 (0.6%)silan.com.cn vendor site is dead — itself a deprecation signal

Zero genuine fabrications. When the model can’t cite real URLs, the prompt forces sources: [] and confidence ≤ 0.3.

The validator script scripts/spot_check.py reproduces this check at any time. Scaling it across all 3,552 URLs takes ~5 minutes with parallelism.

Cost (top-1000 sweep)

Empirical totals across 858 fresh probes:

metricvalue
sum wall-clock17.07 hours
input tokens135,145,470
cached input105,882,624 (78.3% hit rate)
output tokens2,232,420
real probe avg76.7 s
not-a-driver early-exit avg12.8 s
total estimated cost~$240
per actionable verdict~$4.70

Concurrency caveat: this is single-threaded. A 4-way asyncio-gather runner would drop wall-clock to ~4 hours without significantly increasing cost (cache hit rate stays high across parallel calls).

Where the evidence comes from

Tool-call breakdown across the 858 probes:

tool familycallsfailuresfailure cause
MCP lore-http~2,500~1,000 (40%)BM25 inconsistency on the lore-http server (known infrastructure issue)
shell (rg, sed, lei, …)~3,000~100 (3%)mostly rg on missing paths, harmless
web_search~2,500lowmostly successful

Despite the lore_search failure rate, the surviving lore_activity

  • lore_file_timeline calls produced the strongest evidence in the corpus — both remove verdicts driven by lore_file_timeline discovering active 2026-04-22 removal patch series.

Subsystem coverage

The 858 probed dirs span 122 distinct top-level subsystems:

top-15 subsystemsdirs probed
drivers/net188
drivers/media102
drivers/gpu57
drivers/clk32
drivers/scsi32
drivers/crypto29
drivers/pinctrl26
drivers/iio25
drivers/phy23
drivers/infiniband22
drivers/soc21
drivers/video19
drivers/misc17
drivers/dma15
drivers/platform11

The remaining 75 subsystems have 1-3 probes each — the long-tail legacy infrastructure (rapidio, ipack, hsi, parport, sbus, ps3, ssb, bcma, isdn, etc.) where deprecation candidates concentrate.

Diminishing returns past rank 500

The deprecate count between top-500 and top-1000 was unchanged at 42. The model found exactly 3 new remove verdicts in ranks 504-858 (caif, hamradio, isdn/mISDN). Signal density is concentrated in the top ~500.

This is the empirical justification for the top-1000 cutoff. Going to top-2000 would burn ~$140 more for at most a few additional deprecates.

Confidence-vs-verdict shape

Among the 9 remove dossiers:

  • 5 have confidence ≥ 0.90, all backed by lore patches
  • 3 have confidence in 0.83-0.89, backed by mixed lore + web evidence
  • 1 has confidence 0.78 (the weakest remove — the model ranked the older mISDN modular driver alongside the stronger isdn/hardware/mISDN evidence)

Among the 43 deprecate dossiers:

  • 15 have confidence ≥ 0.80, the strongest tier
  • 22 are in 0.70-0.80
  • 6 are in 0.60-0.70 (the model is hedging — the dossier explicitly says “deployment is plausible in $niche but evidence is thin”)

Among keep-annotate:

  • median confidence ~0.78
  • the 401 in this bucket are mostly “old hardware, plausible niche use, no active maintenance, but no strong removal case either”

Independent cross-checks

Five high-leverage verdicts spot-verified by hand:

driververdictconfmanual check
net/ethernet/qlogicdeprecate0.82confirmed: dir root is qla3xxx.c (legacy 2006); active qed/qede are own leaves at lower scores
net/ethernet/ibm/eheadeprecate0.82confirmed: dossier names ibmveth as replacement; IBM Power11 docs cited explicitly say HEA unsupported
misc/c2portdeprecate0.66confirmed: appropriately hedged — Silicon Labs still documents C2 protocol, model flagged niche-use risk
net/ethernet/fujitsuremove0.95confirmed via lore MCP: real Andrew Lunn patch series, 1213 lines deleted, dated 2026-04-22
net/ethernet/packetenginesremove0.95confirmed via lore MCP: real Xidian student patch series, 2026-04-22

Zero of five verdicts were overturned. Caveat: spot-checks are not random sampling. The actionable subset (51 drivers) is small enough to be auditable individually before any disclosure.

Re-run reproducibility

The pipeline is reproducible-ish:

  • Phase 1 (no LLM) is bit-identical given the same kernel SHA, ref, and since date.
  • Phase 2 (LLM) is approximately reproducible — re-running the same prompt with the same model and effort produces verdicts that match in ~95% of cases (rough estimate from spot-checks). Source URLs cited can vary from run to run; the verdict and confidence track closely.

Each meta.json records the model + reasoning effort + kernel SHA used, so corpora built across multiple model versions can be audited.