Phase 1: dormancy ranking

The phase-1 ranker is the cost-control front-end of the pipeline. It decides which directories are worth spending codex calls on.

It runs in scripts/phase1_rank.py, takes about three minutes on a kernel checkout with five years of history, and emits a ranked JSON of every leaf driver directory plus a top-N shortlist.

What gets ranked

The unit is a leaf directory under drivers/ containing at least one .c file. There are roughly 2,050 such directories in current Linux mainline.

A directory is excluded from ranking entirely if any of these hold:

filterlocation in phase1_rank.pyrationale
asset-suffix pathNON_DRIVER_SUFFIXES/tests, /include, /dt, /docs, /examples, /uapi, /fixtures, etc. — content directories that are not drivers
no driver-entry-point macroDRIVER_MARKER_REthe dir’s .c files contain no module_init, module_<bus>_driver(), platform_driver_register(), pci_register_driver(), etc.
mega-subsystem prefixMEGA_SUBSYSTEM_PREFIXESmatches an explicit blocklist (see below)

After filtering, the ranker emits dormancy scores for every remaining directory. The top-N shortlist is built from this with parent-subsumption applied: if a directory is in the shortlist, none of its descendants enter the shortlist separately.

Per-directory features

featurederivation
commits_5yevery non-merge commit in the last 5 years that touched a file directly inside this dir (not subdirs)
substantive_commits_5ycommits_5y minus mechanical sweeps (subject regex blocklist, bot author blocklist, >50 files in one commit cap)
mechanical_commits_5ythe difference
first_touch_tsearliest commit (across all history) touching any file in this dir
last_touch_tsmost recent commit (within 5y window)
last_substantive_touch_tsmost recent substantive commit
unique_authors_5ydistinct authors across substantive commits
top_author, top_author_commitsmost prolific author and their share

The “substantive” filter is the heart of the methodology. A directory with commits_5y=15, substantive_commits_5y=2 is much more dormant than the raw count suggests — most of the touches are tree-wide cleanups, not real maintenance. Without this filter the ranker would systematically underrate quiet drivers.

Mechanical-commit filter

A commit counts as mechanical if any of:

  • subject matches one of MECHANICAL_SUBJECT_PATTERNS: ^treewide:, ^checkpatch:, ^coccinelle:, ^cocci:, ^spdx:, ^dt-bindings:, ^maintainers:, ^license:, ^kbuild:, ^kconfig:, ^fix spelling, ^fix typo, ^scripts/, ^automatic, ^documentation:, etc.
  • author matches BOT_AUTHOR_SUBSTRINGS: kernel test robot, syzbot, Stephen Rothwell (linux-next mechanical), etc.
  • the commit touched ≥ TREEWIDE_FILE_THRESHOLD (50) files

The blocklist is conservative — it drops obvious sweeps and lets through anything ambiguous. If a real bug fix is dropped because its subject starts with Kconfig:, the conservatism is fine: the verdict it would have nudged toward is keep-annotate rather than deprecate, and keep-annotate is the safe default.

Driver-entry-point markers

Several legacy drivers don’t use the modern module_<bus>_driver() macros — they declare a struct platform_driver and call platform_driver_register() from an __init function. A naive regex would falsely classify these as non-driver content.

DRIVER_MARKER_RE covers both forms:

module_init / module_exit / module_driver
module_(pci|platform|i2c|spi|usb|...)_(driver|drvdata|device)
platform_driver_register / pci_register_driver / usb_register_driver / ...
struct (platform|pci|usb|i2c|spi|mdio|mhi|scsi)_driver \w+ =

Empirically, the regex finds a marker in 1,427 of 2,027 candidate directories. The remaining ~600 are legitimately non-driver content (routing tables, helper-only dirs, internal test code). Any false negative shows up as dormancy_score = 0 and is excluded from the shortlist — the cost is missing a candidate, not producing bad output.

The dormancy score

let years_since_touch     = max(0, (now - last_substantive_touch_ts) / SECONDS_PER_YEAR)
let dir_age_years         = max(0, (now - first_touch_ts) / SECONDS_PER_YEAR)
let sub                   = substantive_commits_5y
let age_gate              = 1 if dir_age_years >= 5 else 0

dormancy_score = age_gate
               * log1p(years_since_touch)
               / log1p(1 + sub)

# Forced to 0 for is_mega_subsystem or no driver marker.

In words: a directory is dormant to the extent that it is old (age_gate), its last substantive touch was long ago (log1p(years_since_touch)), and it has had few substantive touches recently (/ log1p(1 + sub)).

The log1p smoothing dampens both numerator and denominator so a single recent substantive touch on a long-lived dir doesn’t collapse the score to zero, and a few extra years of total age don’t dominate.

The age_gate is a hard step rather than a smooth ramp on purpose: it eliminates the false-positive case where a recent sub-directory split (e.g. drivers/gpu/drm/amd/display/dc/... was re-organised in 2024) shows up as “1 commit in 5y, very old” when it is actually part of a hyperactive subsystem.

Mega-subsystem blocklist

Listed in phase1_rank.py::MEGA_SUBSYSTEM_PREFIXES:

  • drivers/gpu/drm/amd/, drivers/gpu/drm/i915/, drivers/gpu/drm/xe/, drivers/gpu/drm/nouveau/
  • drivers/net/wireless/realtek/rtw88/ and rtw89/
  • drivers/net/wireless/intel/iwlwifi/
  • drivers/net/wireless/mediatek/mt76/
  • drivers/net/ethernet/intel/
  • drivers/net/ethernet/mellanox/
  • drivers/net/ethernet/broadcom/
  • drivers/nvme/
  • drivers/scsi/megaraid/
  • drivers/md/dm
  • drivers/staging/

Any leaf directory inside one of these prefixes scores 0 and is not probed. Two reasons:

  1. These are very-active subsystems. Nearly every leaf is keep. The dossier produced would be uniformly “yes, keep, I see lots of recent activity.”
  2. The codex cost would dominate. A 100-driver foray into drivers/gpu/drm/amd/ could spend ~$25 to confirm what we already know.

drivers/staging/ is blocked specifically because it has its own deprecation process (gradual promotion or removal); the dossier verdict would be redundant with the upstream staging/ maintainers’ decisions.

The blocklist is editable. Adding a prefix is one-line. Removing a prefix and re-running is also one line, just costs more.

Parent-subsumption

After scoring, the shortlist build walks the ordered list and skips a directory if any of its ancestor paths is already accepted. This guarantees one probe per driver subtree.

Example: in an early run, ranks 4-7 were drivers/comedi/drivers/ni_routing/ni_device_routes, ...ni_route_values, ...tests, and the parent ...ni_routing itself. Without parent-subsumption all four would be probed; with it, only ...ni_routing (the parent) is probed and the children are skipped. Cleaner and ~75% cheaper for that subtree.

The trade-off: when the parent has multiple drivers some of which might warrant separate verdicts, the parent dossier averages them. For most directories this is acceptable; mixed-fate campaigns where one of N drivers is being removed are an explicit limitation, see limitations.md.

Sample top-20

Running phase 1 on Linux 6.x mainline against --since 2021-04-24 produces (approximately):

rank score  sub commits age_y last_y path
   1  4.00    0       0    5.6    0.0 drivers/pinctrl/visconti
   2  4.00    0       0    8.6    0.0 drivers/soc/lantiq
   3  1.56    1       3   13.7    4.6 drivers/media/pci/pluto2
   4  1.55    1       1   14.5    4.5 drivers/net/ethernet/silan
   5  1.52    1       1   20.5    4.3 drivers/rapidio/switches
   6  1.49    1       3   10.0    4.1 drivers/clk/axis
   7  1.23    2       3   13.7    4.5 drivers/media/usb/gspca/gl860
   8  1.05    1       2    8.6    2.2 drivers/phy/lantiq
   ...

Eyeballing this list against known facts: rapidio is dead-quiet 1990s research interconnect, gl860 is a long-tail webcam chipset, lantiq xDSL was Intel’s spinoff and is now retired. Hits.

The score being saturated at 4.0 for the top entries is intentional — those are dirs with zero recorded activity in the window plus first-touch older than 5 years. Anything past the saturation point is at most “more dormant than the score can distinguish”.

How to extend

Adding a new feature to phase 1:

  1. Extend DirFeatures with the new field
  2. Populate it in populate_git_features or populate_first_touch
  3. Optionally fold it into dormancy_score (be conservative — if a feature isn’t well-calibrated, leave it informational and just let it appear in the dossier static_features.json)
  4. Re-run on the full tree (3 minutes), eyeball the top-20 against known cases

Adding a new filter:

  1. Append to NON_DRIVER_SUFFIXES, MECHANICAL_SUBJECT_PATTERNS, BOT_AUTHOR_SUBSTRINGS, or MEGA_SUBSYSTEM_PREFIXES
  2. Re-run, compare the candidate count delta — if the filter excludes a directory you expected to keep, the prefix is too broad