Limitations
What this corpus is, and is not.
1. Granularity is the directory, not the driver file
The unit of analysis is the leaf directory under drivers/.
For directories that contain a single driver (e.g.
drivers/net/ethernet/fujitsu/ has only fmvj18x_cs.c), the
dossier accurately reflects that driver.
For directories containing multiple independent drivers, the dossier averages across them. Examples:
drivers/net/ethernet/3com/contains six drivers: 3c509, 3c515, 3c574_cs, 3c589_cs, 3c59x, typhoon. Andrew Lunn’s 2026-04 removal series targets four (3c509, 3c515, 3c574, 3c589) but leaves 3c59x and typhoon. Our dossier reportsremoveat 0.92 confidence — accurate to the direction of the campaign but not granular enough to distinguish per-file fates within the dir.drivers/net/ethernet/amd/contains 14 drivers, of which only 2 (lance, nmclan_cs) are being removed in the same Lunn series. The dossier sees the dir as 86% active and reportskeep-annotate— correct at the dir level, but the targeted removal of 2 specific drivers is not surfaced.drivers/net/ethernet/8390/has 17 drivers; 4 are being removed (axnet_cs, pcnet_cs, smc-ultra, wd). Dir-level dossier iskeep-annotate; per-file granularity would be 4×remove- 13×
keep.
- 13×
A future per-file pipeline (single dossier per .c) would catch
these. The current dir-level pipeline is a deliberate cost trade —
running per-file would multiply cost by ~5× without changing the
direction of most verdicts.
2. Mega-subsystem leaves are excluded
phase1_rank.py::MEGA_SUBSYSTEM_PREFIXES blocks any leaf
directory inside:
drivers/gpu/drm/{amd,i915,xe,nouveau}/drivers/net/wireless/{realtek/rtw88,realtek/rtw89,intel/iwlwifi,mediatek/mt76}/drivers/net/ethernet/{intel,mellanox,broadcom}/drivers/nvme/,drivers/scsi/megaraid/,drivers/md/dmdrivers/staging/
These are virtually all keep and the codex cost would not pay
off. The blocklist is in source and editable. If you suspect a
deprecation candidate inside one of these subtrees (e.g.
drivers/gpu/drm/i915/display/dvo_ns2501 for an obscure DVO
chip), remove the prefix and re-run.
drivers/staging/ is excluded specifically because it has its
own deprecation/promotion process; producing dossier verdicts
there would conflict with upstream staging/ maintainers’
explicit decisions.
3. Snapshot, not a feed
The corpus is a one-shot snapshot. Each dossier records the
kernel HEAD SHA in its meta.json. Re-running picks up:
- new drivers added since the snapshot
- drivers whose status has changed (removal patches landing, new fixes appearing, syzbot reports, etc.)
- drivers removed entirely (the dossier is left in place; a follow-up sweep should reconcile)
Empirically, dormancy state changes slowly — a quarterly re-run catches most state changes. Daily/weekly re-runs are wasted budget.
4. Phase-1 dormancy is one signal, not a diagnosis
A directory at the top of the dormancy ranking is a “candidate
worth probing”, not a “deprecate me” signal. Many high-dormancy
directories turn out to be keep-annotate or even keep once
the phase-2 dossier runs:
drivers/pinctrl/visconti— quiet because the Toshiba TMPV7708 SoC is a niche automotive part with stable hardware, but the driver works correctly and is supported by Toshibadrivers/soc/lantiq— quiet because Lantiq xDSL chipsets are long-stabilised, but OpenWrt 24.10.2 still ships images for the platform
The model’s job in phase 2 is exactly to discriminate these from genuine deprecation candidates.
5. Lore search is fragile
The lore-http MCP server’s lore_search (BM25 fused with
trigram) returns errors ~40% of the time on the current corpus
because of a known infrastructure issue (BM25 generation behind
corpus). The model handles this gracefully — falls back to
lore_regex or lore_file_timeline — but in a small fraction of
cases the model gives up on lore evidence and relies on web
search alone, weakening the dossier.
This is an upstream-tool-health issue, not a methodology issue. Once BM25 is rebuilt, dossier quality should marginally improve on a re-run.
6. Web search recall is uneven
The model’s web search works well for:
- vendor product pages (Marvell, Broadcom, IBM, etc.)
- distro package indexes (cateee LKDDb, Debian / Fedora pkgs)
- canonical references (Wikipedia, kernel docs)
- virtualisation guest docs (QEMU, VMware, VirtualBox)
- hobbyist communities (OpenWrt, postmarketOS, Maemo wikis)
It is weaker for:
- non-English vendor sites (often not indexed)
- forum threads and mailing-list archives outside lore.kernel.org
- archived / historical pages that have been moved without redirects
- regional retail sites (would tell us whether hardware is still sold in a specific country)
Where web search is weak, dossier confidence drops naturally — the model’s prompt forces low confidence when sources are thin.
7. The recommendation is not a patch
A remove dossier means: there is enough evidence to take the
question seriously. It is not a sanctioned upstream patch.
Going from a remove dossier to a sent patch requires:
- per-driver maintainer review
- a
Cc: stable@vger.kernel.orgdiscussion if any backports matter - adherence to the project’s disclosure / cleanup conventions
The corpus is seed material for those conversations, not a substitute for them.
8. The model can be wrong
Spot-checks have found zero overturned verdicts so far across ~5 hand-verified deprecate/remove cases, but this is a small sample. At larger scale, expect:
- the occasional false positive
deprecateon a driver that’s quietly fine (e.g., a stable industrial driver where the vendor’s online presence has shrunk but deployments are real) - the occasional false negative
keepon a driver that has silently rotted (no obvious removal patch yet, but no real users either)
Treat dossiers with confidence < 0.7 as candidates for human
review rather than automation inputs.
9. URL freshness
URLs cited in dossiers are timestamped at the moment of probing.
A vendor EOL page or distro config link can disappear or move
later. The validator script (spot_check.py) HEAD-resolves URLs
on demand, but the corpus itself doesn’t track per-URL
expiration. For a public-facing site, periodic re-validation is
worth scheduling.
10. Cost and access
Reproducing the corpus requires:
- ~$140-240 of OpenAI codex API budget (depending on top-N)
- access to a working
lore.kernel.orgMCP server - a Linux kernel checkout
The first item is the hard barrier — without API access the data is read-only for downstream consumers. The corpus + scripts are the durable artifacts.
If any of these limitations matters for your use case, the methodology is structured to let you fix it: change one parameter, re-run, get a different snapshot. The discipline of keeping every fact cited and every invocation reproducible is what makes the corpus correctable rather than opaque.