Phase 1: dormancy ranking
The phase-1 ranker is the cost-control front-end of the pipeline. It decides which directories are worth spending codex calls on.
It runs in scripts/phase1_rank.py, takes about three minutes on
a kernel checkout with five years of history, and emits a
ranked JSON of every leaf driver directory plus a top-N
shortlist.
What gets ranked
The unit is a leaf directory under drivers/ containing at
least one .c file. There are roughly 2,050 such directories
in current Linux mainline.
A directory is excluded from ranking entirely if any of these hold:
| filter | location in phase1_rank.py | rationale |
|---|---|---|
| asset-suffix path | NON_DRIVER_SUFFIXES | /tests, /include, /dt, /docs, /examples, /uapi, /fixtures, etc. — content directories that are not drivers |
| no driver-entry-point macro | DRIVER_MARKER_RE | the dir’s .c files contain no module_init, module_<bus>_driver(), platform_driver_register(), pci_register_driver(), etc. |
| mega-subsystem prefix | MEGA_SUBSYSTEM_PREFIXES | matches an explicit blocklist (see below) |
After filtering, the ranker emits dormancy scores for every remaining directory. The top-N shortlist is built from this with parent-subsumption applied: if a directory is in the shortlist, none of its descendants enter the shortlist separately.
Per-directory features
| feature | derivation |
|---|---|
commits_5y | every non-merge commit in the last 5 years that touched a file directly inside this dir (not subdirs) |
substantive_commits_5y | commits_5y minus mechanical sweeps (subject regex blocklist, bot author blocklist, >50 files in one commit cap) |
mechanical_commits_5y | the difference |
first_touch_ts | earliest commit (across all history) touching any file in this dir |
last_touch_ts | most recent commit (within 5y window) |
last_substantive_touch_ts | most recent substantive commit |
unique_authors_5y | distinct authors across substantive commits |
top_author, top_author_commits | most prolific author and their share |
The “substantive” filter is the heart of the methodology. A
directory with commits_5y=15, substantive_commits_5y=2 is much
more dormant than the raw count suggests — most of the touches
are tree-wide cleanups, not real maintenance. Without this
filter the ranker would systematically underrate quiet drivers.
Mechanical-commit filter
A commit counts as mechanical if any of:
- subject matches one of
MECHANICAL_SUBJECT_PATTERNS:^treewide:,^checkpatch:,^coccinelle:,^cocci:,^spdx:,^dt-bindings:,^maintainers:,^license:,^kbuild:,^kconfig:,^fix spelling,^fix typo,^scripts/,^automatic,^documentation:, etc. - author matches
BOT_AUTHOR_SUBSTRINGS:kernel test robot,syzbot,Stephen Rothwell(linux-next mechanical), etc. - the commit touched ≥
TREEWIDE_FILE_THRESHOLD(50) files
The blocklist is conservative — it drops obvious sweeps and lets
through anything ambiguous. If a real bug fix is dropped because
its subject starts with Kconfig:, the conservatism is fine: the
verdict it would have nudged toward is keep-annotate rather than
deprecate, and keep-annotate is the safe default.
Driver-entry-point markers
Several legacy drivers don’t use the modern module_<bus>_driver()
macros — they declare a struct platform_driver and call
platform_driver_register() from an __init function. A naive
regex would falsely classify these as non-driver content.
DRIVER_MARKER_RE covers both forms:
module_init / module_exit / module_driver
module_(pci|platform|i2c|spi|usb|...)_(driver|drvdata|device)
platform_driver_register / pci_register_driver / usb_register_driver / ...
struct (platform|pci|usb|i2c|spi|mdio|mhi|scsi)_driver \w+ =
Empirically, the regex finds a marker in 1,427 of 2,027
candidate directories. The remaining ~600 are legitimately
non-driver content (routing tables, helper-only dirs, internal
test code). Any false negative shows up as dormancy_score = 0
and is excluded from the shortlist — the cost is missing a
candidate, not producing bad output.
The dormancy score
let years_since_touch = max(0, (now - last_substantive_touch_ts) / SECONDS_PER_YEAR)
let dir_age_years = max(0, (now - first_touch_ts) / SECONDS_PER_YEAR)
let sub = substantive_commits_5y
let age_gate = 1 if dir_age_years >= 5 else 0
dormancy_score = age_gate
* log1p(years_since_touch)
/ log1p(1 + sub)
# Forced to 0 for is_mega_subsystem or no driver marker.
In words: a directory is dormant to the extent that it is old
(age_gate), its last substantive touch was long ago
(log1p(years_since_touch)), and it has had few substantive
touches recently (/ log1p(1 + sub)).
The log1p smoothing dampens both numerator and denominator so a
single recent substantive touch on a long-lived dir doesn’t
collapse the score to zero, and a few extra years of total age
don’t dominate.
The age_gate is a hard step rather than a smooth ramp on
purpose: it eliminates the false-positive case where a recent
sub-directory split (e.g. drivers/gpu/drm/amd/display/dc/... was
re-organised in 2024) shows up as “1 commit in 5y, very old” when
it is actually part of a hyperactive subsystem.
Mega-subsystem blocklist
Listed in phase1_rank.py::MEGA_SUBSYSTEM_PREFIXES:
drivers/gpu/drm/amd/,drivers/gpu/drm/i915/,drivers/gpu/drm/xe/,drivers/gpu/drm/nouveau/drivers/net/wireless/realtek/rtw88/andrtw89/drivers/net/wireless/intel/iwlwifi/drivers/net/wireless/mediatek/mt76/drivers/net/ethernet/intel/drivers/net/ethernet/mellanox/drivers/net/ethernet/broadcom/drivers/nvme/drivers/scsi/megaraid/drivers/md/dmdrivers/staging/
Any leaf directory inside one of these prefixes scores 0 and is not probed. Two reasons:
- These are very-active subsystems. Nearly every leaf is
keep. The dossier produced would be uniformly “yes, keep, I see lots of recent activity.” - The codex cost would dominate. A 100-driver foray into
drivers/gpu/drm/amd/could spend ~$25 to confirm what we already know.
drivers/staging/ is blocked specifically because it has its own
deprecation process (gradual promotion or removal); the dossier
verdict would be redundant with the upstream staging/
maintainers’ decisions.
The blocklist is editable. Adding a prefix is one-line. Removing a prefix and re-running is also one line, just costs more.
Parent-subsumption
After scoring, the shortlist build walks the ordered list and skips a directory if any of its ancestor paths is already accepted. This guarantees one probe per driver subtree.
Example: in an early run, ranks 4-7 were
drivers/comedi/drivers/ni_routing/ni_device_routes,
...ni_route_values, ...tests, and the parent
...ni_routing itself. Without parent-subsumption all four
would be probed; with it, only ...ni_routing (the parent) is
probed and the children are skipped. Cleaner and ~75% cheaper for
that subtree.
The trade-off: when the parent has multiple drivers some of which might warrant separate verdicts, the parent dossier averages them. For most directories this is acceptable; mixed-fate campaigns where one of N drivers is being removed are an explicit limitation, see limitations.md.
Sample top-20
Running phase 1 on Linux 6.x mainline against --since 2021-04-24
produces (approximately):
rank score sub commits age_y last_y path
1 4.00 0 0 5.6 0.0 drivers/pinctrl/visconti
2 4.00 0 0 8.6 0.0 drivers/soc/lantiq
3 1.56 1 3 13.7 4.6 drivers/media/pci/pluto2
4 1.55 1 1 14.5 4.5 drivers/net/ethernet/silan
5 1.52 1 1 20.5 4.3 drivers/rapidio/switches
6 1.49 1 3 10.0 4.1 drivers/clk/axis
7 1.23 2 3 13.7 4.5 drivers/media/usb/gspca/gl860
8 1.05 1 2 8.6 2.2 drivers/phy/lantiq
...
Eyeballing this list against known facts: rapidio is dead-quiet 1990s research interconnect, gl860 is a long-tail webcam chipset, lantiq xDSL was Intel’s spinoff and is now retired. Hits.
The score being saturated at 4.0 for the top entries is intentional — those are dirs with zero recorded activity in the window plus first-touch older than 5 years. Anything past the saturation point is at most “more dormant than the score can distinguish”.
How to extend
Adding a new feature to phase 1:
- Extend
DirFeatureswith the new field - Populate it in
populate_git_featuresorpopulate_first_touch - Optionally fold it into
dormancy_score(be conservative — if a feature isn’t well-calibrated, leave it informational and just let it appear in the dossierstatic_features.json) - Re-run on the full tree (3 minutes), eyeball the top-20 against known cases
Adding a new filter:
- Append to
NON_DRIVER_SUFFIXES,MECHANICAL_SUBJECT_PATTERNS,BOT_AUTHOR_SUBSTRINGS, orMEGA_SUBSYSTEM_PREFIXES - Re-run, compare the candidate count delta — if the filter excludes a directory you expected to keep, the prefix is too broad