PCIe link lost, device now detached: chasing an invisible I225 ASPM L1 hang
Why btop Detaches My Onboard Ethernet Card
The Intel I225-V at PCI 0a:00.0 had been blacklisted in
/etc/modprobe.d/ for two years. A short comment above the
blacklist line read: "The NIC is broken on Linux". I had bought a USB
2.5G dongle and forgotten about the onboard NIC, the way you do when
there are other problems.
This is the story of what was actually wrong with it.
The hardware in question is a rev 03 die, the B3 stepping Intel shipped specifically to fix the well-known I225-V 2.5G connectivity bug from earlier revisions. That earlier bug shows up as link drops, renegotiations, and noisy dmesg around link state. The signature in this post is different: config space alive, MMIO dead, no link retraining, no AER. If you saw "I225-V" and "2.5G" in the title and assumed the famous one, this is a different defect.
The trigger
Thirty-six hours into uptime, dmesg printed this. The
Modules linked in line, the register dump, and the
?-marked speculative stack frames are omitted; the WARNING,
Comm:, and the real call chain are verbatim:
igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
------------[ cut here ]------------
igc: Failed to read reg 0xc030!
WARNING: CPU: 15 PID: 9903 at drivers/net/ethernet/intel/igc/igc_main.c:7009 igc_rd32+0x9d/0xb0 [igc]
CPU: 15 UID: 1000 PID: 9903 Comm: btop Tainted: G OE 6.18.24-1-lts
RIP: 0010:igc_rd32+0x9d/0xb0 [igc]
Call Trace:
<TASK>
igc_update_stats+0x8a/0x6d0 [igc]
igc_get_stats64+0xa2/0xb0 [igc]
dev_get_stats+0x62/0x1b0
rtnl_fill_stats+0x3b/0x130
rtnl_fill_ifinfo.isra.0+0x894/0x1660
rtnl_dump_ifinfo+0x48f/0x5f0
rtnl_dumpit+0x7c/0x90
netlink_dump+0x173/0x3a0
__netlink_dump_start+0x1ed/0x310
rtnetlink_rcv_msg+0x2a1/0x3e0
netlink_rcv_skb+0x5c/0x110
netlink_unicast+0x288/0x3c0
netlink_sendmsg+0x20d/0x430
__sys_sendto+0x1d3/0x1e0
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x81/0x7d0
The process is identified twice in this block: the kernel's
Comm: btop line, and the netlink path through
netlink_sendmsg → rtnetlink → rtnl_dump_ifinfo. That second
one matters. If btop polled stats from
/proc/net/dev, the call trace would go through
dev_seq_show, not netlink_sendmsg. The trace
is unambiguous evidence that this btop build uses
RTM_GETLINK over netlink for per-interface counters.
So the chain is: btop dumps interface stats → kernel
calls each driver's ndo_get_stats64 →
igc_get_stats64 → igc_update_stats → igc_rd32 → the
register read returns all ones → driver detaches. btop is
the messenger. The device was already gone before
sendmsg(2) reached the kernel.
What 0xc030 is
igc_regs.h line 107:
#define IGC_RQDPC(_n) (0x0C030 + ((_n) * 0x40))RQDPC is the Receive Queue Drop Packet Count. Queue
zero's RQDPC is 0xc030.
The reason this specific register matters is that it is the first
MMIO read inside igc_update_stats. The function begins:
void igc_update_stats(struct igc_adapter *adapter)
{
...
if (adapter->link_speed == 0)
return;
if (pci_channel_offline(pdev))
return;
...
rcu_read_lock();
for (i = 0; i < adapter->num_rx_queues; i++) {
struct igc_ring *ring = adapter->rx_ring[i];
u32 rqdpc = rd32(IGC_RQDPC(i)); /* <-- first MMIO */
...The two early-return checks (link speed and
pci_channel_offline) read kernel state, not the device. The
next thing the function does is rd32(IGC_RQDPC(0)). That is
the read that failed.
This is consistent with the L1-exit story rather than a mid-pass wedge. If the device had serviced a few reads and then locked up, the first failing register would not be 0xc030; it would be the second or third one in the stats walk. That 0xc030 is the very first read points at "device unreachable since the last quiet period, host woke link, link wake did not complete, first MMIO returned all ones."
The detector
igc_rd32 is the function that caught the failure. It has
looked roughly like this since the driver's first kernel:
u32 igc_rd32(struct igc_hw *hw, u32 reg)
{
struct igc_adapter *igc = container_of(hw, struct igc_adapter, hw);
u8 __iomem *hw_addr = READ_ONCE(hw->hw_addr);
u32 value = 0;
if (IGC_REMOVED(hw_addr))
return ~value;
value = readl(&hw_addr[reg]);
/* reads should not return all F's */
if (!(~value) && (!reg || !(~readl(hw_addr)))) {
struct net_device *netdev = igc->netdev;
hw->hw_addr = NULL;
netif_device_detach(netdev);
netdev_err(netdev, "PCIe link lost, device now detached\n");
WARN(pci_device_is_present(igc->pdev),
"igc: Failed to read reg 0x%x!\n", reg);
}
return value;
}The netdev_err + netif_device_detach block
was added by Sasha Neftin in commit c9a11c23ceb6 on
2018-10-11, near the start of the driver's life. The defensive
WARN keyed on
pci_device_is_present(igc->pdev) was added by Lyude Paul
in commit 94bc1e522b32 on 2019-08-22.
The check on the second-to-last line is the interesting part. A
register reads 0xFFFFFFFF. The driver double-checks by
reading register 0 (the base of BAR0, which is IGC_CTRL)
and verifying it too is all ones. If both are, the device is presumed
gone. The WARN fires only if pci_device_is_present() still
says yes, which means PCIe config space is still answering even though
MMIO is dead.
Config space alive, MMIO dead, no link retrain, no AER. That is not a hot-unplug. That is a device that has hung internally.
What AER said
Nothing.
/sys/bus/pci/devices/0000:0a:00.0/aer_dev_correctable,
aer_dev_nonfatal, and aer_dev_fatal all read
zero, on the I225 itself and on every upstream port from the root port
down. The PCIe core's Advanced Error Reporting subsystem did not log a
single uncorrectable error, correctable error, or link recovery event
for the entire 36 hours leading up to the detach. There was no warning
in dmesg in the five minutes before the all-ones read. The device just
stopped answering MMIO.
If you are debugging similar symptoms, do not grep for AER. There is nothing to find.
The PCIe L1 latency budget
The whole path from root port to the I225 had ASPM L1 enabled at the
moment of failure. Six devices, five links, every
LnkCtl: ASPM L1 Enabled. L1 substates were not in use: the
I225's L1SubCap advertises
ASPM_L1.1+ ASPM_L1.2- (capable of L1.1, not even capable of
L1.2), and L1SubCtl1 shows every substate disabled. The
link was using plain L1. (Worth noting in passing: because the I225 is
not L1.2-capable at all, the I226 L1.2 fix discussed below would be a
no-op for it even if its device-ID guard were widened.)
The numbers that matter come from two capability fields:
| Device | LnkCap: Exit Latency L1 |
|---|---|
| 00:02.1 (root port) | <32us |
| 03:00.0 (switch upstream) | <32us |
| 04:08.0 | <32us |
| 06:00.0 | <32us |
| 07:05.0 (switch port to I225) | <32us |
| 0a:00.0 (I225) | <4us |
| Device | DevCap: L1 Acceptable Latency |
|---|---|
| 0a:00.0 (I225) | <64us |
The "Acceptable Latency" is the maximum delay the endpoint will tolerate between issuing a PM request and the link being back in L0. Software is supposed to confirm the endpoint's experienced L1 exit latency stays under that figure before enabling L1.
The subtlety is how "experienced latency" is computed, and it is not
a sum. Linux's pcie_aspm_check_latency() walks the path
link by link and, for each link, compares that link's own L1 exit
latency (the larger of the two directions) plus a fixed 1µs-per-switch
allowance against the endpoint's acceptable latency:
latency = max_t(u32, latency_up_l1, latency_dw_l1);
if ((link->aspm_capable & PCIE_LINK_STATE_L1) &&
(latency + l1_switch_latency > acceptable_l1))
link->aspm_capable &= ~PCIE_LINK_STATE_L1;
l1_switch_latency += NSEC_PER_USEC;l1_switch_latency is the accumulator: it starts at zero
and the last line bumps it by one microsecond
(NSEC_PER_USEC) per link as the loop walks toward the root
port. The kernel's own comment states the model outright: "Every switch
on the path to root complex need 1 more microsecond for L1." Nothing
adds the 32µs links together.
Run the I225's numbers through that model and L1 passes comfortably.
Each link's exit latency is max(...) = 32µs, and the
per-switch allowance grows 0, 1, 2, 3, 4µs as the walk reaches the root
port, so the worst link is checked as roughly 36µs against the 64µs
budget. Every link clears it. The kernel enabled L1 here by ordinary
policy, not by any override and not in violation of its own check.
(The I225 does advertise ASPMOptComp+. That is the ASPM
Optionality Compliance bit, "I function correctly with ASPM on or off",
not a latency-budget bypass. aspm.c never reads it to
enable L1 over budget, and on this path it did not need to.)
That is the actual trap. The budget the kernel checks is a model, and on this topology the model is too optimistic. The physical L1 exit does not run across one 32µs link; it runs across five serialized hops with a retimer in the path, re-establishing each link in turn. The 1µs-per-switch allowance is a heuristic from an era of shallower topologies, and it under-counts what a deep retimer'd path costs to wake. Nothing in the modeled budget, kernel or spec, would have flagged this configuration. That is exactly why L1 was on. And because the failure is in the exit handshake rather than in any transaction, AER never sees it.
I have not put the L1 exit on an oscilloscope, so I cannot point at the microsecond where the silicon gives up. What I can say is concrete and checkable: the path's modeled exit latency clears the endpoint's budget, so L1 is enabled by normal policy; the physical exit on a five-hop retimer'd path is plausibly far larger than the model assumes; the failure mode is "first MMIO after a long quiet period returns all ones"; and AER is silent throughout. That is consistent with an L1-exit handshake that does not complete on this topology. It is not a proof.
What the driver was doing about it
$ git grep -n pci_disable_link_state v6.18 -- drivers/net/ethernet/intel/igc/igc_main.c
7161: pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
7548: pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
7676: pci_disable_link_state_locked(pdev, PCIE_LINK_STATE_L1_2);
All three sites are guarded by
igc_is_device_id_i226(hw). They only fire for the newer
I226. They only ever disable PCIE_LINK_STATE_L1_2, the
deepest substate. Plain L1, the level my hang happens at, is never
disabled for any device ID.
The relevant commit is
0325143b59c6 igc: disable L1.2 PCI-E link substate to avoid performance issue,
landed in v6.16 (2025-07-01), with this comment:
I226 devices advertise support for the PCI-E link L1.2 substate. However, due to a hardware limitation, the exit latency from this low-power state is longer than the packet buffer can tolerate under high traffic conditions. This can lead to packet loss and degraded performance.
That fix is for an I226 packet-loss problem, not an I225 catastrophic-hang problem. The I225 is left alone.
For comparison, r8169_main.c has this in its probe path,
rtl_init_one():
/* Disable ASPM L1 as that cause random device stop working
* problems as well as full system hangs for some PCIe devices users.
*/
if (rtl_aspm_is_safe(tp)) {
dev_info(&pdev->dev, "System vendor flags ASPM as safe\n");
rc = 0;
} else {
rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1);
}That comment matches my symptoms almost exactly: "random device stop
working" and "full system hangs". r8169 disables ASPM L1 by default at
probe and only re-enables when the system vendor has set a specific MAC
OCP register flag through rtl_aspm_is_safe() to certify the
board.
drivers/bluetooth/hci_bcm4377.c is a second precedent,
though it disables ASPM more broadly. It turns off both L0s and L1 for a
documented hardware erratum, and is emphatic enough to clear the LnkCtl
bits directly in case pci_disable_link_state is
refused:
static void bcm4377_disable_aspm(struct bcm4377_data *bcm4377)
{
pci_disable_link_state(bcm4377->pdev,
PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1);
/* ... We must *always* disable ASPM for this device due to
* hardware errata though. */
pcie_capability_clear_word(bcm4377->pdev, PCI_EXP_LNKCTL, ...);
}The distinction worth keeping is which knob each driver reaches for.
Two disable PCIE_LINK_STATE_L1 specifically, which is the
exact knob my workaround uses: r8169, and dwmac-motorcomm
(the stmmac glue for Motorcomm's YT6801, whose comment reads "let's
disable L1 state unconditionally for safety"). bcm4377 disables ASPM
more broadly, L0s and L1 together, for its documented erratum. The rest
include L1 only as part of a wider blanket policy:
hpsa, mpt3sas, aacraid, and
jme use L0S | L1 | CLKPM;
alcor_pci uses L0S | L1. Those blanket cases
are weaker analogies, because they read as "we never want ASPM on this
class of card", not "this silicon's L1 exit is broken." The cleanest
precedents, an Ethernet driver disabling L1 alone because L1
specifically misbehaves, are r8169 and dwmac-motorcomm.
What igc has for I225 in 2025 is nothing.
Trying to reproduce on demand
A passive 36-hour soak is a poor diagnostic instrument. I wrote a script that toggled the link in and out of L1 as aggressively as possible:
- Short idle 1 to 5 seconds, then
ethtool -S(which readsIGC_RQDPC(0), exactly the register that failed) - Long idle 60 to 300 seconds, then
ethtool -S - Short idle, then 50 register reads back-to-back
The script ran for an hour. 151 iterations. ASPM L1 was confirmed enabled the whole time. AER stayed at zero. No detach.
The natural failure happened to a process that polled stats roughly every two seconds for 36 hours, which works out to about 65,000 register-read events before the hang. My 151 iterations is roughly 1/430 of that density. The math suggests the failure rate per L1 entry/exit cycle is very low. The bug exists, but it does not yield to one hour of aggressive cycling.
The workaround
The candidate workaround is one sysfs file:
# /etc/tmpfiles.d/igc-aspm-disable.conf
w /sys/bus/pci/devices/0000:0a:00.0/link/l1_aspm - - - - 0
l1_aspm is the gate for plain L1. Disabling it
implicitly disables the substates too (no L1, no L1.1, no L1.2), which
lines up with what my lspci -vvv already showed before the
workaround: L1SubCtl1 had
ASPM_L1.1- ASPM_L1.2- even with plain L1 enabled. The other
knobs (l1_1_aspm, l1_1_pcipm) are redundant on
this hardware.
Two things to confirm after applying it:
lspci -vvv -s 0a:00.0should now showLnkCtl: ASPM Disabledwhere it previously showedASPM L1 Enabled. Firmware can retain ASPM control and silently ignore the sysfs write; verify that the bit actually flipped.systemd-tmpfiles-setup.serviceruns early in boot, but it is not the first thing the kernel does after probing the device. The exposure window between PCI probe and tmpfiles-setup is on the order of a few seconds, not zero. The bug's per-trigger rate is low enough that a few seconds does not matter in practice, but if you want a tighter binding, use a udev rule keyed on the device:
# /etc/udev/rules.d/99-igc-aspm.rules
SUBSYSTEM=="pci", ACTION=="add", KERNEL=="0000:0a:00.0", ATTR{link/l1_aspm}="0"
The udev rule runs the moment the kernel adds the device, before any other userspace touches it.
What is still uncertain
The reproduction work has been ongoing since the original detach. With ASPM L1 enabled and the NIC linked at 1Gbps to a known-good switch port, the provoke script has run for the equivalent of 17 days across multiple boots, totalling about 23,000 trigger events. Zero detaches.
Naively that sounds like "1Gbps is safe". Statistically it is not even close. The natural rate I measured was 1 event in ~65,000 trigger cycles. Under that same rate, the expected number of events in 23,000 cycles is about 0.35, and the probability of seeing zero is e^(-0.35) ≈ 0.70. In other words: if 1Gbps and 2.5Gbps had identical per-cycle failure rates, I would still see zero failures in 23,000 cycles about 70% of the time. The 1G run is underpowered to detect a difference, not evidence of one.
To actually distinguish the two link speeds, I either need a second natural-or-provoked detach at 2.5Gbps to tighten the rate estimate, or I need enough 1Gbps cycles that zero becomes improbable under the 2.5Gbps rate. That second threshold is around 200,000 cycles for a 95% lower bound.
The clean causal test for ASPM is also still pending, and the honest evidence is weaker than I would like. Every run since the original detach:
- Detach at 2.5Gbps, organic load, ASPM L1 enabled. One event. This is the only time the NIC has run at 2.5Gbps, and it failed.
- No detach at 1Gbps, ASPM L1 enabled, ~23,000 provoke cycles.
- No detach at 1Gbps, ASPM L1 disabled (the candidate workaround), weeks of uptime.
Read those together and the problem is stark. Both stable arms are at 1Gbps, and at 1Gbps the NIC does not detach whether L1 is on or off. So the 1G data cannot tell me whether disabling L1 does anything; 1G simply never fails. And the only 2.5Gbps run I have detached. I have never observed this card running stably at its rated 2.5Gbps under any configuration, with or without the workaround.
That makes the current state honest but unsatisfying: the machine has a working wired link only because it is running a 2.5G NIC at 1G. So far the "workaround" is indistinguishable from "I stopped using the speed that fails." Disabling L1 is well-motivated by the latency-model argument above and is the right thing to test, but I have not earned the claim that it fixes anything. The test still owed: reproduce at 2.5Gbps with L1 enabled, then equal exposure at 2.5Gbps with L1 disabled. Until that runs, L1-disable is a hypothesis with a plausible mechanism, not a demonstrated fix.
What I can say plainly:
- A candidate workaround exists, is small, and lives entirely in userspace. Whether it actually prevents the detach at 2.5Gbps is not yet shown; every clean run so far has been at 1G, where the bug does not bite regardless.
- The driver has no plain-L1 defensive code for I225 device IDs. The
pattern of disabling
PCIE_LINK_STATE_L1specifically for a known defect already exists in r8169 and dwmac-motorcomm; bcm4377 disables L0s and L1 together for its erratum. - The failure leaves no AER trace. Anyone debugging similar symptoms by grepping for PCIe errors will find nothing.
- The kernel enabled L1 by ordinary policy: its modeled exit latency for this path (per-link exit + 1µs/switch, not a sum) clears the I225's 64µs budget. The defect is that the model under-counts a deep retimer'd path's real exit latency. The budget being satisfied is why nothing stopped L1, not evidence the path is safe.
igc_rd32's all-ones detector is the only line of defence the driver has. The WARN it prints names the failing register but does not name the cause. The cause is two lines earlier in dmesg, in a single line saying "PCIe link lost", and even that line is the consequence rather than the trigger.
The blacklist line in /etc/modprobe.d/ is gone. What
replaced it is not a fix. The card carries traffic because it is pinned
to 1G, the one speed that has never failed; at its rated 2.5G I have a
single run and a single detach. The real question, whether an I225-V on
this board can run reliably at 2.5G with ASPM L1 disabled, is still
open. I have a plausible mechanism, a one-line lever to test it, and not
yet the evidence to claim it works. That test, and the second 2.5G
detach that would anchor it, is the next post.
References and further reading
- Intel Ethernet Controller I225 product brief (PDF) — the part, its steppings, and the 2.5G feature set.
- My Arch Linux forum thread on this NIC — the running discussion, with the dmesg and configuration details as they were gathered.
igc: disable L1.2 PCI-E link substate to avoid performance issue(0325143b59c6) — the I226-only L1.2 fix discussed above.igc_rd32and the all-ones detach detector (igc_main.c) — the function that prints "PCIe link lost, device now detached".pcie_aspm_check_latency()(drivers/pci/pcie/aspm.c) — the per-link L1 exit latency check with the "1 more microsecond per switch" model.- r8169's default ASPM L1 disable (r8169_main.c) — the precedent: an Ethernet driver that disables L1 unless the board is certified safe.
- Kernel
PCIe ASPM documentation —
pcie_aspm,pcie_aspm.policy, and the per-devicelink/l1_aspmsysfs knobs.
Hardware: ASUS ROG STRIX X670E-E, onboard I225-V rev 03 (B3
stepping), subsystem 1043:87d2, BIOS 3402. Kernel: Linux
6.18.24-1-lts and forward. NVM/firmware: 1082:8770. Path:
AMD root port → internal PCIe switch (multi-hop, retimer in the segment
to 0a:00.0) → I225. The dmesg block is from the running 6.18.24-1-lts
kernel (hence igc_main.c:7009). Source quotes and the grep
line numbers are against the v6.18 release tag; absolute
line numbers drift by a handful across the stable series and mainline,
so the function names and commit hashes are the durable references. The
pcie_aspm_check_latency() excerpt is byte-identical in
v6.18 and current mainline (v7.1).
Comments
Post a Comment