Limitations

What this corpus is, and is not.

1. Granularity is the directory, not the driver file

The unit of analysis is the leaf directory under drivers/. For directories that contain a single driver (e.g. drivers/net/ethernet/fujitsu/ has only fmvj18x_cs.c), the dossier accurately reflects that driver.

For directories containing multiple independent drivers, the dossier averages across them. Examples:

drivers/net/ethernet/3com/ contains six drivers: 3c509, 3c515, 3c574_cs, 3c589_cs, 3c59x, typhoon. Andrew Lunn’s 2026-04 removal series targets four (3c509, 3c515, 3c574, 3c589) but leaves 3c59x and typhoon. Our dossier reports remove at 0.92 confidence — accurate to the direction of the campaign but not granular enough to distinguish per-file fates within the dir.
drivers/net/ethernet/amd/ contains 14 drivers, of which only 2 (lance, nmclan_cs) are being removed in the same Lunn series. The dossier sees the dir as 86% active and reports keep-annotate — correct at the dir level, but the targeted removal of 2 specific drivers is not surfaced.
drivers/net/ethernet/8390/ has 17 drivers; 4 are being removed (axnet_cs, pcnet_cs, smc-ultra, wd). Dir-level dossier is keep-annotate; per-file granularity would be 4× remove
- 13× keep.

A future per-file pipeline (single dossier per .c) would catch these. The current dir-level pipeline is a deliberate cost trade — running per-file would multiply cost by ~5× without changing the direction of most verdicts.

2. Mega-subsystem leaves are excluded

phase1_rank.py::MEGA_SUBSYSTEM_PREFIXES blocks any leaf directory inside:

drivers/gpu/drm/{amd,i915,xe,nouveau}/
drivers/net/wireless/{realtek/rtw88,realtek/rtw89,intel/iwlwifi,mediatek/mt76}/
drivers/net/ethernet/{intel,mellanox,broadcom}/
drivers/nvme/, drivers/scsi/megaraid/, drivers/md/dm
drivers/staging/

These are virtually all keep and the codex cost would not pay off. The blocklist is in source and editable. If you suspect a deprecation candidate inside one of these subtrees (e.g. drivers/gpu/drm/i915/display/dvo_ns2501 for an obscure DVO chip), remove the prefix and re-run.

drivers/staging/ is excluded specifically because it has its own deprecation/promotion process; producing dossier verdicts there would conflict with upstream staging/ maintainers’ explicit decisions.

3. Snapshot, not a feed

The corpus is a one-shot snapshot. Each dossier records the kernel HEAD SHA in its meta.json. Re-running picks up:

new drivers added since the snapshot
drivers whose status has changed (removal patches landing, new fixes appearing, syzbot reports, etc.)
drivers removed entirely (the dossier is left in place; a follow-up sweep should reconcile)

Empirically, dormancy state changes slowly — a quarterly re-run catches most state changes. Daily/weekly re-runs are wasted budget.

4. Phase-1 dormancy is one signal, not a diagnosis

A directory at the top of the dormancy ranking is a “candidate worth probing”, not a “deprecate me” signal. Many high-dormancy directories turn out to be keep-annotate or even keep once the phase-2 dossier runs:

drivers/pinctrl/visconti — quiet because the Toshiba TMPV7708 SoC is a niche automotive part with stable hardware, but the driver works correctly and is supported by Toshiba
drivers/soc/lantiq — quiet because Lantiq xDSL chipsets are long-stabilised, but OpenWrt 24.10.2 still ships images for the platform

The model’s job in phase 2 is exactly to discriminate these from genuine deprecation candidates.

5. Lore search is fragile

The lore-http MCP server’s lore_search (BM25 fused with trigram) returns errors ~40% of the time on the current corpus because of a known infrastructure issue (BM25 generation behind corpus). The model handles this gracefully — falls back to lore_regex or lore_file_timeline — but in a small fraction of cases the model gives up on lore evidence and relies on web search alone, weakening the dossier.

This is an upstream-tool-health issue, not a methodology issue. Once BM25 is rebuilt, dossier quality should marginally improve on a re-run.

6. Web search recall is uneven

The model’s web search works well for:

vendor product pages (Marvell, Broadcom, IBM, etc.)
distro package indexes (cateee LKDDb, Debian / Fedora pkgs)
canonical references (Wikipedia, kernel docs)
virtualisation guest docs (QEMU, VMware, VirtualBox)
hobbyist communities (OpenWrt, postmarketOS, Maemo wikis)

It is weaker for:

non-English vendor sites (often not indexed)
forum threads and mailing-list archives outside lore.kernel.org
archived / historical pages that have been moved without redirects
regional retail sites (would tell us whether hardware is still sold in a specific country)

Where web search is weak, dossier confidence drops naturally — the model’s prompt forces low confidence when sources are thin.

7. The recommendation is not a patch

A remove dossier means: there is enough evidence to take the question seriously. It is not a sanctioned upstream patch. Going from a remove dossier to a sent patch requires:

per-driver maintainer review
a Cc: stable@vger.kernel.org discussion if any backports matter
adherence to the project’s disclosure / cleanup conventions

The corpus is seed material for those conversations, not a substitute for them.

8. The model can be wrong

Spot-checks have found zero overturned verdicts so far across ~5 hand-verified deprecate/remove cases, but this is a small sample. At larger scale, expect:

the occasional false positive deprecate on a driver that’s quietly fine (e.g., a stable industrial driver where the vendor’s online presence has shrunk but deployments are real)
the occasional false negative keep on a driver that has silently rotted (no obvious removal patch yet, but no real users either)

Treat dossiers with confidence < 0.7 as candidates for human review rather than automation inputs.

9. URL freshness

URLs cited in dossiers are timestamped at the moment of probing. A vendor EOL page or distro config link can disappear or move later. The validator script (spot_check.py) HEAD-resolves URLs on demand, but the corpus itself doesn’t track per-URL expiration. For a public-facing site, periodic re-validation is worth scheduling.

10. Cost and access

Reproducing the corpus requires:

~$140-240 of OpenAI codex API budget (depending on top-N)
access to a working lore.kernel.org MCP server
a Linux kernel checkout

The first item is the hard barrier — without API access the data is read-only for downstream consumers. The corpus + scripts are the durable artifacts.

If any of these limitations matters for your use case, the methodology is structured to let you fix it: change one parameter, re-run, get a different snapshot. The discipline of keeping every fact cited and every invocation reproducible is what makes the corpus correctable rather than opaque.