Reproduce

Concrete recipe for re-running every phase of the activity-tracking pipeline against a fresh kernel snapshot, a different model, or a different prompt. Each phase is idempotent and independently re-runnable.

Prereqs

  • A Linux kernel checkout, any branch or SHA you want to snapshot against. Set KERNEL_ROOT=/path/to/your/linux for the commands below.
  • The OpenAI codex CLI, with a working mcp_servers.lore-http entry in ~/.codex/config.toml. The probe uses lore.kernel.org MCP tools (lore_activity, lore_file_timeline, lore_search, lore_regex) plus the model’s native web_search.
  • uv — the Python package manager used by every script in scripts/. The scripts are PEP-723 inline-script style; uv run --script foo.py resolves dependencies and runs the script in one step.
  • About $140 of OpenAI API budget for a full top-1000 sweep at gpt-5.4 / model_reasoning_effort="medium". Per-probe cost is ~$0.20–0.30; ~78 % of input tokens cache across a batch.
  • (Optional) a lei mirror of lore.kernel.org for offline lore queries. Not required — the MCP server will fetch live if lei isn’t reachable.

Phase 1 — dormancy ranker (no LLM)

uv run --script scripts/phase1_rank.py \
    --kernel-root "$KERNEL_ROOT" \
    --since 2021-04-24 \
    --top-n 1000 \
    --out data/phase1-ranking.json \
    --shortlist data/phase1-shortlist.txt

Walks every leaf directory under drivers/, computes the deterministic features (commit counts, age, last-substantive-touch, top author, dormancy score), and emits two artifacts:

  • data/phase1-ranking.json — the full ranking of all ~2,000 leaf dirs (including the ones forced to score 0 by the mega-subsystem and no-driver-marker filters).
  • data/phase1-shortlist.txt — the top-N (default 1000) candidates with parent-subsumption applied.

Wall time: ~3 minutes on a kernel checkout with five years of history. No API spend.

See ranking for the dormancy formula and the mega-subsystem blocklist.

Phase 2 — codex dossier probe

uv run --script scripts/phase2_probe.py \
    --shortlist data/phase1-shortlist.txt \
    --kernel-root "$KERNEL_ROOT" \
    --schema data/schema.v1.json \
    --out-dir data/dossiers

For each shortlisted directory, runs one codex exec call producing a strict-schema dossier. The actual command, frozen on disk in every data/dossiers/<path>/meta.json, is:

codex exec \
    --ephemeral --ignore-rules --skip-git-repo-check \
    -c model="gpt-5.4" \
    -c model_reasoning_effort="medium" \
    -s workspace-write \
    -C "$KERNEL_ROOT" \
    --add-dir data/dossiers/<path> \
    --add-dir "/run/user/$(id -u)" \
    --output-schema data/schema.v1.json \
    -o data/dossiers/<path>/dossier.json \
    --json \
    "<prompt>"

The probe is idempotent: it skips any directory whose dossier.json already exists. To re-probe a subset, pass --force --include drivers/foo,drivers/bar. To re-probe everything, pass --force alone.

Per-probe stats (recorded in meta.json):

  • ~75 seconds wall clock
  • ~170k input tokens (~78 % cached after the first few probes in a batch)
  • ~3k output tokens
  • ~$0.20–0.30 cost at current gpt-5.4 pricing

Sequential probing of all 864 dirs: ~18 hours wall, ~$200. Parallelize with xargs -P against your account’s per-minute RPM/TPM if you want it faster.

See pipeline for the prompt design and tool budget.

Phase 3 — validate

uv run --script scripts/validate_dossiers.py
uv run --script scripts/spot_check.py data/dossiers

validate_dossiers.py is the structural check: every dossier dir has the expected files, dossier.json validates against data/schema.v1.json, and driver_path matches the directory layout. Across 864 dossiers, it reports zero issues.

spot_check.py issues curl -L HEAD requests against every cited URL in parallel and classifies each as 2xx / 3xx / 4xx / 5xx / blocked. Bot-blocked URLs (Anubis on lore, Cloudflare on some vendor sites) return 403/429 — those are real, just gated. Genuine fabrications would be 404s.

Refresh the snapshot

To bump to a newer kernel:

  1. cd "$KERNEL_ROOT" && git pull
  2. Re-run Phase 1 (3 min, no cost).
  3. Re-run Phase 2 — incremental by default; only newly-shortlisted dirs get probed. Add --force if you want to re-probe drivers whose dossiers might be stale.
  4. Re-run Phase 3.
  5. Rebuild the site: cd site && npm run build. The site reads the corpus through the symlinks in site/src/content/driver and site/src/data/, so no code change is needed.

Change the model or the prompt

The model and reasoning effort are in phase2_probe.py:

"-c", 'model="gpt-5.4"',
"-c", 'model_reasoning_effort="medium"',

The prompt is built in build_prompt() in the same file. Edit either, then re-probe with --force on the affected paths. Each run preserves its own model, model_reasoning_effort, and the full codex_cmd argv in meta.json, so a single corpus can mix versions and remain auditable.

Bulk export

The site publishes the corpus as flat machine-readable downloads for anyone who wants to crunch it directly without scraping the HTML pages:

  • /data/dossiers.json — all 864 LLM dossiers as a single JSON array, full schema (chipset, verdict, confidence, sources, reasoning_notes).
  • /data/dossiers.csv — same, flattened to CSV. Sources are joined with ; in a single column; reasoning_notes is preserved as a quoted cell.
  • /data/registry.json — all 2,028 leaf driver directories as a single array, with a kind field (dossier or stub) and the deterministic Phase 1 features for every entry.
  • /data/registry.csv — same, CSV-flat.

Both files are regenerated on every site build, so they always match the current snapshot. License: CC-BY-4.0 (data) — cite this project + the kernel SHA in data/index.json when you use it.