Dossier schema reference
Every data/dossiers/<driver-path>/dossier.json validates against
data/schema.v1.json. The schema is a closed Draft-2020-12
JSON Schema:
- every property is in
required - every level has
additionalProperties: false - every enum is closed
This makes the corpus directly usable as an Astro content collection (mirror the schema in Zod) or any other typed data pipeline.
Top-level fields
{
"driver_path": "drivers/net/wireless/ath/ar5523",
"chipset_family": "Atheros AR5523 802.11abg USB",
"hardware_still_sold_new_in_2025": false,
"last_widely_available_year": 2010,
"deployments_today": "low",
"replacement_driver": null,
"recommendation_hint": "deprecate",
"confidence": 0.79,
"sources": [{"url": "...", "claim": "..."}],
"reasoning_notes": "Long prose explaining the verdict + provenance of each cited URL."
}
Field-by-field
driver_path — string
Path within the Linux source tree, e.g.
drivers/net/wireless/ath/ar5523. Must match the directory
the dossier lives in (the validator checks this).
chipset_family — string
Short chipset family name as referred to in vendor docs. Free
text. May be empty for not-a-driver entries. Examples:
“Atheros AR5523 802.11abg USB”, “Cortina/StorLink Gemini SoC”,
“3Com Vortex/Boomerang/Cyclone PCI Ethernet”.
hardware_still_sold_new_in_2025 — boolean
Best evidence-based answer to “is this hardware sold new today?” The model is asked to ground this in vendor pages, distro package indexes, retail searches, etc. — not training-data recall alone.
last_widely_available_year — integer | null
The year the hardware was last widely available retail. Range is 1990-2026. Null if the model could not pin a date with confidence; the prompt prefers null over guessing.
deployments_today — enum
One of none | low | medium | high | unknown. The qualitative
answer to “is anyone running this today?” — combines virt
guest support, embedded use, distro defaults, hobbyist
communities.
none— no evidence of any current deploymentlow— niche / hobbyist / single-vendor industrial, observed but raremedium— real ongoing use in non-trivial populationhigh— broadly deployedunknown— model could not find evidence
replacement_driver — string | null
The upstream driver(s) that cover the same use case today. Null if no clean replacement exists. Examples:
ibmveth(replaces IBM eHEA)e1000e(replaces classic 3Com / DEC PCI NICs)ath9k_htc(replaces ar5523 USB Atheros)null(no equivalent — driver is genuinely orphaned)
recommendation_hint — enum
One of keep | keep-annotate | deprecate | remove | unsure | not-a-driver. The headline output of the dossier.
keep— actively maintained, broadly deployed, no action neededkeep-annotate— mostly inactive but plausible niche use; document the niche rather than removedeprecate— strong evidence the hardware is gone; candidate for the next removal series; no in-flight patch yetremove— an upstream removal patch is already in flight for this driver; the dossier exists to surface that fact and back the patch with evidenceunsure— confidence < 0.5; the model declined to commitnot-a-driver— the directory is content (routing tables, test fixtures, header-only) rather than a driver. Phase-1 catches most of these; this is the model’s safety net for ones that slipped through.
confidence — number
Range [0, 1]. Self-reported by the model. Calibration:
< 0.3→ almost alwaysunsureornot-a-driverearly-exit0.3-0.6→ cautious recommendation, oftenkeep-annotate0.6-0.8→ defensible recommendation backed by 3-5 cited URLs> 0.8→ strong recommendation; multiple corroborating sources; often a lore.kernel.org cite to active upstream activity0.9+→ reserved forremoveverdicts where an in-flight removal patch is cited
sources — array
Zero or more {url, claim} objects. The prompt requires every
non-trivial fact to be cited. Empty array is allowed when the
model has no evidence and confidence is correspondingly low.
{
"url": "https://lore.kernel.org/netdev/20260422-...-lunn.ch/",
"claim": "April 22, 2026 removal patch series proposes deleting fmvj18x_cs.c (1213 lines)."
}
URL well-formedness is post-validated by scripts/spot_check.py
which HEAD-resolves every URL and reports anomalies.
reasoning_notes — string
Free text. The model uses this to:
- explain the verdict in 1-2 sentences
- name the provenance of each cited URL (“the lore URL came from
lore_file_timeline; the LKDDb URL was canonical recall; the vendor EOL URL was found viaweb_search”) - record any caveats or uncertainty
The prompt explicitly asks for provenance attribution, which gives auditors a single place to spot-check tool-vs-recall claims.
Schema enforcement
Two layers of validation:
- At inference time: codex passes
--output-schemato the Responses API, which enforces the schema server-side. The model cannot emit a malformed JSON or break enums; the API refuses the response. - Post-hoc:
scripts/validate_dossiers.pyre-validates every dossier withjsonschema.Draft202012Validator, plus structural checks: required files present,driver_pathmatches directory layout, URLs well-formed,meta.jsoncarries token counts.
On the current corpus, post-hoc validation reports zero issues across 864 dossiers and 3,552 cited URLs.
Mapping to Zod (Astro)
import { z, defineCollection } from "astro:content";
const driver = defineCollection({
type: "data",
schema: z.object({
driver_path: z.string(),
chipset_family: z.string(),
hardware_still_sold_new_in_2025: z.boolean(),
last_widely_available_year: z.number().int().min(1990).max(2026).nullable(),
deployments_today: z.enum(["none", "low", "medium", "high", "unknown"]),
replacement_driver: z.string().nullable(),
recommendation_hint: z.enum([
"keep", "keep-annotate", "deprecate",
"remove", "unsure", "not-a-driver",
]),
confidence: z.number().min(0).max(1),
sources: z.array(z.object({
url: z.string().url(),
claim: z.string(),
})),
reasoning_notes: z.string(),
}).strict(),
});
export const collections = { driver };
Sibling files
Beyond dossier.json, every per-driver dir contains:
summary.json— derived index entry:{driver_path, valid_json, recommendation_hint, confidence, sources, tool_counts, tool_log}meta.json— invocation receipt:{driver_path, kernel_root, kernel_head_sha, phase1_since, codex_cmd, elapsed_s, exit_code, in_tokens, cached_input_tokens, out_tokens, generated_at}static_features.json— phase-1 git-log facts: see ranking.md for the field listprompt.md— the prompt sent to codex (auditable)events.jsonl— full codex event streamstderr.log— codex stderr
For the website, the dossier is the primary content. The other files are useful for an “audit trail” view per driver: show what was asked, what tools fired, how long it took, how many tokens it cost, against which kernel SHA.
Schema versioning
schema.v1.json is v1. If a future revision changes shape:
- bump to
schema.v2.json - keep v1 around for old dossiers
- the validator should pick the schema version from the dossier
(we don’t currently embed it; consider adding a
_schemafield if v2 lands)
The current schema is intentionally simple. Future fields to consider:
cve_count_5y— number of CVEs touching this driver in 5 yearsdistro_enablement[]— structured per-distro Y/M/Naudit_trail— names of tools used by the model in producing the dossier (currently free-text inreasoning_notes)