PCIe link lost, device now detached: chasing an invisible I225 ASPM L1 hang

Why btop Detaches My Onboard Ethernet Card

The Intel I225-V at PCI 0a:00.0 had been blacklisted in /etc/modprobe.d/ for two years. A short comment above the blacklist line read: "The NIC is broken on Linux". I had bought a USB 2.5G dongle and forgotten about the onboard NIC, the way you do when there are other problems.

This is the story of what was actually wrong with it.

The hardware in question is a rev 03 die, the B3 stepping Intel shipped specifically to fix the well-known I225-V 2.5G connectivity bug from earlier revisions. That earlier bug shows up as link drops, renegotiations, and noisy dmesg around link state. The signature in this post is different: config space alive, MMIO dead, no link retraining, no AER. If you saw "I225-V" and "2.5G" in the title and assumed the famous one, this is a different defect.

The trigger

Thirty-six hours into uptime, dmesg printed this. The Modules linked in line, the register dump, and the ?-marked speculative stack frames are omitted; the WARNING, Comm:, and the real call chain are verbatim:

igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
------------[ cut here ]------------
igc: Failed to read reg 0xc030!
WARNING: CPU: 15 PID: 9903 at drivers/net/ethernet/intel/igc/igc_main.c:7009 igc_rd32+0x9d/0xb0 [igc]
CPU: 15 UID: 1000 PID: 9903 Comm: btop  Tainted: G OE  6.18.24-1-lts
RIP: 0010:igc_rd32+0x9d/0xb0 [igc]
Call Trace:
 <TASK>
 igc_update_stats+0x8a/0x6d0 [igc]
 igc_get_stats64+0xa2/0xb0 [igc]
 dev_get_stats+0x62/0x1b0
 rtnl_fill_stats+0x3b/0x130
 rtnl_fill_ifinfo.isra.0+0x894/0x1660
 rtnl_dump_ifinfo+0x48f/0x5f0
 rtnl_dumpit+0x7c/0x90
 netlink_dump+0x173/0x3a0
 __netlink_dump_start+0x1ed/0x310
 rtnetlink_rcv_msg+0x2a1/0x3e0
 netlink_rcv_skb+0x5c/0x110
 netlink_unicast+0x288/0x3c0
 netlink_sendmsg+0x20d/0x430
 __sys_sendto+0x1d3/0x1e0
 __x64_sys_sendto+0x24/0x30
 do_syscall_64+0x81/0x7d0

The process is identified twice in this block: the kernel's Comm: btop line, and the netlink path through netlink_sendmsg → rtnetlink → rtnl_dump_ifinfo. That second one matters. If btop polled stats from /proc/net/dev, the call trace would go through dev_seq_show, not netlink_sendmsg. The trace is unambiguous evidence that this btop build uses RTM_GETLINK over netlink for per-interface counters.

So the chain is: btop dumps interface stats → kernel calls each driver's ndo_get_stats64igc_get_stats64 → igc_update_stats → igc_rd32 → the register read returns all ones → driver detaches. btop is the messenger. The device was already gone before sendmsg(2) reached the kernel.

What 0xc030 is

igc_regs.h line 107:

#define IGC_RQDPC(_n)    (0x0C030 + ((_n) * 0x40))

RQDPC is the Receive Queue Drop Packet Count. Queue zero's RQDPC is 0xc030.

The reason this specific register matters is that it is the first MMIO read inside igc_update_stats. The function begins:

void igc_update_stats(struct igc_adapter *adapter)
{
    ...
    if (adapter->link_speed == 0)
        return;
    if (pci_channel_offline(pdev))
        return;
    ...
    rcu_read_lock();
    for (i = 0; i < adapter->num_rx_queues; i++) {
        struct igc_ring *ring = adapter->rx_ring[i];
        u32 rqdpc = rd32(IGC_RQDPC(i));         /* <-- first MMIO */
        ...

The two early-return checks (link speed and pci_channel_offline) read kernel state, not the device. The next thing the function does is rd32(IGC_RQDPC(0)). That is the read that failed.

This is consistent with the L1-exit story rather than a mid-pass wedge. If the device had serviced a few reads and then locked up, the first failing register would not be 0xc030; it would be the second or third one in the stats walk. That 0xc030 is the very first read points at "device unreachable since the last quiet period, host woke link, link wake did not complete, first MMIO returned all ones."

The detector

igc_rd32 is the function that caught the failure. It has looked roughly like this since the driver's first kernel:

u32 igc_rd32(struct igc_hw *hw, u32 reg)
{
    struct igc_adapter *igc = container_of(hw, struct igc_adapter, hw);
    u8 __iomem *hw_addr = READ_ONCE(hw->hw_addr);
    u32 value = 0;

    if (IGC_REMOVED(hw_addr))
        return ~value;

    value = readl(&hw_addr[reg]);

    /* reads should not return all F's */
    if (!(~value) && (!reg || !(~readl(hw_addr)))) {
        struct net_device *netdev = igc->netdev;

        hw->hw_addr = NULL;
        netif_device_detach(netdev);
        netdev_err(netdev, "PCIe link lost, device now detached\n");
        WARN(pci_device_is_present(igc->pdev),
             "igc: Failed to read reg 0x%x!\n", reg);
    }

    return value;
}

The netdev_err + netif_device_detach block was added by Sasha Neftin in commit c9a11c23ceb6 on 2018-10-11, near the start of the driver's life. The defensive WARN keyed on pci_device_is_present(igc->pdev) was added by Lyude Paul in commit 94bc1e522b32 on 2019-08-22.

The check on the second-to-last line is the interesting part. A register reads 0xFFFFFFFF. The driver double-checks by reading register 0 (the base of BAR0, which is IGC_CTRL) and verifying it too is all ones. If both are, the device is presumed gone. The WARN fires only if pci_device_is_present() still says yes, which means PCIe config space is still answering even though MMIO is dead.

Config space alive, MMIO dead, no link retrain, no AER. That is not a hot-unplug. That is a device that has hung internally.

What AER said

Nothing.

/sys/bus/pci/devices/0000:0a:00.0/aer_dev_correctable, aer_dev_nonfatal, and aer_dev_fatal all read zero, on the I225 itself and on every upstream port from the root port down. The PCIe core's Advanced Error Reporting subsystem did not log a single uncorrectable error, correctable error, or link recovery event for the entire 36 hours leading up to the detach. There was no warning in dmesg in the five minutes before the all-ones read. The device just stopped answering MMIO.

If you are debugging similar symptoms, do not grep for AER. There is nothing to find.

The PCIe L1 latency budget

The whole path from root port to the I225 had ASPM L1 enabled at the moment of failure. Six devices, five links, every LnkCtl: ASPM L1 Enabled. L1 substates were not in use: the I225's L1SubCap advertises ASPM_L1.1+ ASPM_L1.2- (capable of L1.1, not even capable of L1.2), and L1SubCtl1 shows every substate disabled. The link was using plain L1. (Worth noting in passing: because the I225 is not L1.2-capable at all, the I226 L1.2 fix discussed below would be a no-op for it even if its device-ID guard were widened.)

The numbers that matter come from two capability fields:

Device LnkCap: Exit Latency L1
00:02.1 (root port) <32us
03:00.0 (switch upstream) <32us
04:08.0 <32us
06:00.0 <32us
07:05.0 (switch port to I225) <32us
0a:00.0 (I225) <4us
Device DevCap: L1 Acceptable Latency
0a:00.0 (I225) <64us

The "Acceptable Latency" is the maximum delay the endpoint will tolerate between issuing a PM request and the link being back in L0. Software is supposed to confirm the endpoint's experienced L1 exit latency stays under that figure before enabling L1.

The subtlety is how "experienced latency" is computed, and it is not a sum. Linux's pcie_aspm_check_latency() walks the path link by link and, for each link, compares that link's own L1 exit latency (the larger of the two directions) plus a fixed 1µs-per-switch allowance against the endpoint's acceptable latency:

latency = max_t(u32, latency_up_l1, latency_dw_l1);
if ((link->aspm_capable & PCIE_LINK_STATE_L1) &&
    (latency + l1_switch_latency > acceptable_l1))
        link->aspm_capable &= ~PCIE_LINK_STATE_L1;
l1_switch_latency += NSEC_PER_USEC;

l1_switch_latency is the accumulator: it starts at zero and the last line bumps it by one microsecond (NSEC_PER_USEC) per link as the loop walks toward the root port. The kernel's own comment states the model outright: "Every switch on the path to root complex need 1 more microsecond for L1." Nothing adds the 32µs links together.

Run the I225's numbers through that model and L1 passes comfortably. Each link's exit latency is max(...) = 32µs, and the per-switch allowance grows 0, 1, 2, 3, 4µs as the walk reaches the root port, so the worst link is checked as roughly 36µs against the 64µs budget. Every link clears it. The kernel enabled L1 here by ordinary policy, not by any override and not in violation of its own check.

(The I225 does advertise ASPMOptComp+. That is the ASPM Optionality Compliance bit, "I function correctly with ASPM on or off", not a latency-budget bypass. aspm.c never reads it to enable L1 over budget, and on this path it did not need to.)

That is the actual trap. The budget the kernel checks is a model, and on this topology the model is too optimistic. The physical L1 exit does not run across one 32µs link; it runs across five serialized hops with a retimer in the path, re-establishing each link in turn. The 1µs-per-switch allowance is a heuristic from an era of shallower topologies, and it under-counts what a deep retimer'd path costs to wake. Nothing in the modeled budget, kernel or spec, would have flagged this configuration. That is exactly why L1 was on. And because the failure is in the exit handshake rather than in any transaction, AER never sees it.

I have not put the L1 exit on an oscilloscope, so I cannot point at the microsecond where the silicon gives up. What I can say is concrete and checkable: the path's modeled exit latency clears the endpoint's budget, so L1 is enabled by normal policy; the physical exit on a five-hop retimer'd path is plausibly far larger than the model assumes; the failure mode is "first MMIO after a long quiet period returns all ones"; and AER is silent throughout. That is consistent with an L1-exit handshake that does not complete on this topology. It is not a proof.

What the driver was doing about it

$ git grep -n pci_disable_link_state v6.18 -- drivers/net/ethernet/intel/igc/igc_main.c
7161:        pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
7548:        pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
7676:        pci_disable_link_state_locked(pdev, PCIE_LINK_STATE_L1_2);

All three sites are guarded by igc_is_device_id_i226(hw). They only fire for the newer I226. They only ever disable PCIE_LINK_STATE_L1_2, the deepest substate. Plain L1, the level my hang happens at, is never disabled for any device ID.

The relevant commit is 0325143b59c6 igc: disable L1.2 PCI-E link substate to avoid performance issue, landed in v6.16 (2025-07-01), with this comment:

I226 devices advertise support for the PCI-E link L1.2 substate. However, due to a hardware limitation, the exit latency from this low-power state is longer than the packet buffer can tolerate under high traffic conditions. This can lead to packet loss and degraded performance.

That fix is for an I226 packet-loss problem, not an I225 catastrophic-hang problem. The I225 is left alone.

For comparison, r8169_main.c has this in its probe path, rtl_init_one():

/* Disable ASPM L1 as that cause random device stop working
 * problems as well as full system hangs for some PCIe devices users.
 */
if (rtl_aspm_is_safe(tp)) {
    dev_info(&pdev->dev, "System vendor flags ASPM as safe\n");
    rc = 0;
} else {
    rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1);
}

That comment matches my symptoms almost exactly: "random device stop working" and "full system hangs". r8169 disables ASPM L1 by default at probe and only re-enables when the system vendor has set a specific MAC OCP register flag through rtl_aspm_is_safe() to certify the board.

drivers/bluetooth/hci_bcm4377.c is a second precedent, though it disables ASPM more broadly. It turns off both L0s and L1 for a documented hardware erratum, and is emphatic enough to clear the LnkCtl bits directly in case pci_disable_link_state is refused:

static void bcm4377_disable_aspm(struct bcm4377_data *bcm4377)
{
    pci_disable_link_state(bcm4377->pdev,
                           PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1);
    /* ... We must *always* disable ASPM for this device due to
     * hardware errata though. */
    pcie_capability_clear_word(bcm4377->pdev, PCI_EXP_LNKCTL, ...);
}

The distinction worth keeping is which knob each driver reaches for. Two disable PCIE_LINK_STATE_L1 specifically, which is the exact knob my workaround uses: r8169, and dwmac-motorcomm (the stmmac glue for Motorcomm's YT6801, whose comment reads "let's disable L1 state unconditionally for safety"). bcm4377 disables ASPM more broadly, L0s and L1 together, for its documented erratum. The rest include L1 only as part of a wider blanket policy: hpsa, mpt3sas, aacraid, and jme use L0S | L1 | CLKPM; alcor_pci uses L0S | L1. Those blanket cases are weaker analogies, because they read as "we never want ASPM on this class of card", not "this silicon's L1 exit is broken." The cleanest precedents, an Ethernet driver disabling L1 alone because L1 specifically misbehaves, are r8169 and dwmac-motorcomm.

What igc has for I225 in 2025 is nothing.

Trying to reproduce on demand

A passive 36-hour soak is a poor diagnostic instrument. I wrote a script that toggled the link in and out of L1 as aggressively as possible:

  • Short idle 1 to 5 seconds, then ethtool -S (which reads IGC_RQDPC(0), exactly the register that failed)
  • Long idle 60 to 300 seconds, then ethtool -S
  • Short idle, then 50 register reads back-to-back

The script ran for an hour. 151 iterations. ASPM L1 was confirmed enabled the whole time. AER stayed at zero. No detach.

The natural failure happened to a process that polled stats roughly every two seconds for 36 hours, which works out to about 65,000 register-read events before the hang. My 151 iterations is roughly 1/430 of that density. The math suggests the failure rate per L1 entry/exit cycle is very low. The bug exists, but it does not yield to one hour of aggressive cycling.

The workaround

The candidate workaround is one sysfs file:

# /etc/tmpfiles.d/igc-aspm-disable.conf
w /sys/bus/pci/devices/0000:0a:00.0/link/l1_aspm  -  -  -  -  0

l1_aspm is the gate for plain L1. Disabling it implicitly disables the substates too (no L1, no L1.1, no L1.2), which lines up with what my lspci -vvv already showed before the workaround: L1SubCtl1 had ASPM_L1.1- ASPM_L1.2- even with plain L1 enabled. The other knobs (l1_1_aspm, l1_1_pcipm) are redundant on this hardware.

Two things to confirm after applying it:

  1. lspci -vvv -s 0a:00.0 should now show LnkCtl: ASPM Disabled where it previously showed ASPM L1 Enabled. Firmware can retain ASPM control and silently ignore the sysfs write; verify that the bit actually flipped.
  2. systemd-tmpfiles-setup.service runs early in boot, but it is not the first thing the kernel does after probing the device. The exposure window between PCI probe and tmpfiles-setup is on the order of a few seconds, not zero. The bug's per-trigger rate is low enough that a few seconds does not matter in practice, but if you want a tighter binding, use a udev rule keyed on the device:
# /etc/udev/rules.d/99-igc-aspm.rules
SUBSYSTEM=="pci", ACTION=="add", KERNEL=="0000:0a:00.0", ATTR{link/l1_aspm}="0"

The udev rule runs the moment the kernel adds the device, before any other userspace touches it.

What is still uncertain

The reproduction work has been ongoing since the original detach. With ASPM L1 enabled and the NIC linked at 1Gbps to a known-good switch port, the provoke script has run for the equivalent of 17 days across multiple boots, totalling about 23,000 trigger events. Zero detaches.

Naively that sounds like "1Gbps is safe". Statistically it is not even close. The natural rate I measured was 1 event in ~65,000 trigger cycles. Under that same rate, the expected number of events in 23,000 cycles is about 0.35, and the probability of seeing zero is e^(-0.35) ≈ 0.70. In other words: if 1Gbps and 2.5Gbps had identical per-cycle failure rates, I would still see zero failures in 23,000 cycles about 70% of the time. The 1G run is underpowered to detect a difference, not evidence of one.

To actually distinguish the two link speeds, I either need a second natural-or-provoked detach at 2.5Gbps to tighten the rate estimate, or I need enough 1Gbps cycles that zero becomes improbable under the 2.5Gbps rate. That second threshold is around 200,000 cycles for a 95% lower bound.

The clean causal test for ASPM is also still pending, and the honest evidence is weaker than I would like. Every run since the original detach:

  • Detach at 2.5Gbps, organic load, ASPM L1 enabled. One event. This is the only time the NIC has run at 2.5Gbps, and it failed.
  • No detach at 1Gbps, ASPM L1 enabled, ~23,000 provoke cycles.
  • No detach at 1Gbps, ASPM L1 disabled (the candidate workaround), weeks of uptime.

Read those together and the problem is stark. Both stable arms are at 1Gbps, and at 1Gbps the NIC does not detach whether L1 is on or off. So the 1G data cannot tell me whether disabling L1 does anything; 1G simply never fails. And the only 2.5Gbps run I have detached. I have never observed this card running stably at its rated 2.5Gbps under any configuration, with or without the workaround.

That makes the current state honest but unsatisfying: the machine has a working wired link only because it is running a 2.5G NIC at 1G. So far the "workaround" is indistinguishable from "I stopped using the speed that fails." Disabling L1 is well-motivated by the latency-model argument above and is the right thing to test, but I have not earned the claim that it fixes anything. The test still owed: reproduce at 2.5Gbps with L1 enabled, then equal exposure at 2.5Gbps with L1 disabled. Until that runs, L1-disable is a hypothesis with a plausible mechanism, not a demonstrated fix.

What I can say plainly:

  • A candidate workaround exists, is small, and lives entirely in userspace. Whether it actually prevents the detach at 2.5Gbps is not yet shown; every clean run so far has been at 1G, where the bug does not bite regardless.
  • The driver has no plain-L1 defensive code for I225 device IDs. The pattern of disabling PCIE_LINK_STATE_L1 specifically for a known defect already exists in r8169 and dwmac-motorcomm; bcm4377 disables L0s and L1 together for its erratum.
  • The failure leaves no AER trace. Anyone debugging similar symptoms by grepping for PCIe errors will find nothing.
  • The kernel enabled L1 by ordinary policy: its modeled exit latency for this path (per-link exit + 1µs/switch, not a sum) clears the I225's 64µs budget. The defect is that the model under-counts a deep retimer'd path's real exit latency. The budget being satisfied is why nothing stopped L1, not evidence the path is safe.
  • igc_rd32's all-ones detector is the only line of defence the driver has. The WARN it prints names the failing register but does not name the cause. The cause is two lines earlier in dmesg, in a single line saying "PCIe link lost", and even that line is the consequence rather than the trigger.

The blacklist line in /etc/modprobe.d/ is gone. What replaced it is not a fix. The card carries traffic because it is pinned to 1G, the one speed that has never failed; at its rated 2.5G I have a single run and a single detach. The real question, whether an I225-V on this board can run reliably at 2.5G with ASPM L1 disabled, is still open. I have a plausible mechanism, a one-line lever to test it, and not yet the evidence to claim it works. That test, and the second 2.5G detach that would anchor it, is the next post.

References and further reading


Hardware: ASUS ROG STRIX X670E-E, onboard I225-V rev 03 (B3 stepping), subsystem 1043:87d2, BIOS 3402. Kernel: Linux 6.18.24-1-lts and forward. NVM/firmware: 1082:8770. Path: AMD root port → internal PCIe switch (multi-hop, retimer in the segment to 0a:00.0) → I225. The dmesg block is from the running 6.18.24-1-lts kernel (hence igc_main.c:7009). Source quotes and the grep line numbers are against the v6.18 release tag; absolute line numbers drift by a handful across the stable series and mainline, so the function names and commit hashes are the durable references. The pcie_aspm_check_latency() excerpt is byte-identical in v6.18 and current mainline (v7.1).

Comments

Popular posts from this blog

Aggressive yet sane persistent SSH with systemd and autossh

Annoying spammers

Anycasting IPv6 TCP and UDP