When a Healthy ONU Drops 32,000 Frames to 1

A firmware update broke my fiber WAN. The device that broke it reported itself perfectly healthy on every check I could think of. Here is how I found the fault, the seven wrong turns I took first, and how far the evidence actually goes.

The setup

The uplink is an HSGQ/ODI M110 GPON SFP stick, a Realtek RTL9602C SoC inside a transceiver. Firmware V1.0-220923 works. Firmware V1.1.8-240408 reproducibly kills WAN. Same fiber, same OLT, same hardware. Only the firmware string differs.

The HSGQ/ODI M110 GPON SFP stick.

The cruel part is that the broken firmware looks fine. The ONU reaches O5, the GPON operational state (G.984.3): registered with the OLT, ranged, all alarm flags clear, GEM port mappings present. Every surface check is green. No service traffic reaches the host.

One symptom sent me the wrong way early. Link-local control frames, LLDP (multicast) and MNDP (broadcast), still showed up in the router's traffic sniffer while service unicast did not. I read that as a host-side SerDes problem. That instinct was aimed at the right part of the stick and the wrong layer, which I only understood much later. A dead SerDes link carries nothing; these frames crossed, so the physical link was up. What the clue actually rules out is a dead PHY, not a fault in the code that drives the egress path. Hold that thought; it comes back at the end. (I also did not pin the source of those frames at the time. Nearest-bridge LLDP comes from the closest L2 neighbor, not from several hops upstream, and I should have read the chassis ID instead of assuming.)

Seven theories, all dead

Before I had any discipline, I did what you do: I diffed the two firmwares and built a story around each difference. Every theory was load-bearing on a real diff, a real config field, or a real entry in the GPON management spec. Every one was wrong.

Missing shared-library files in the new firmware. They were present in both.
A MAC-key verification step gating the service push. It passes on the broken firmware.
An OMCI alarm-notify cascade acting as a breaker. It was a transient from one bad activation, gone once O5 is reached.
An init-script flash-key rename. Real diff, but a self-healing config step writes the new key on first boot, so it never fires.
VLAN-tag-operation drift in rebuilt code. The OLT-pushed config was byte-identical between firmwares.
The advertised firmware-version string being used by the OLT as a service discriminator. This one I tested: I spoofed a fake version on the working firmware and watched WAN keep working. Falsified.
A new packet-redirect helper binary eating traffic. Also tested: I froze the process for five minutes (kill -STOP, confirmed state T) and watched the drop ratio not move. Falsified for active per-packet handling, which is what STOP halts. It would not falsify a helper that installs a persistent drop rule and exits the hot path, since the rule survives the freeze. For the stated hypothesis, though, it was the right test.

The last two are the only ones worth anything, because each made a prediction and put it at risk. If the OLT keyed off the version string, a spoofed version on the working firmware should have broken it. It did not. If the helper binary ate the traffic, freezing it should have moved the drop ratio. It did not. That is the difference between an experiment and a diff: an experiment can come back no. The other five are the trap. That is what diff-driven debugging feels like from the inside: each lead is genuine, and none of them is the answer. A version bump changes dozens of things, every change is a plausible lead, and the diff is bottomless. You run out of patience before you run out of stories.

The ladder

The fix was to stop asking "what changed?" and start asking "where do the counters disagree?" The set of changes between two firmwares is huge. The set of subsystems that can drop frames between two specific counters is small.

That gives a four-rung ladder. Walk it top to bottom. Each rung subsumes the territory of every hypothesis below it, so you do not chase any specific theory until all four are run.

Rung 1. Does the ONU reach O5? If not, the fault is at ranging, authentication, optical, or host-side SerDes. Mine reached O5, so all of that is dead.

Rung 2. Is the downstream GEM "Non Idle" counter climbing? If yes, the OLT is pushing traffic and frames are reaching the GEM layer, a sublayer within the GPON transmission-convergence layer (G.984.3). This kills every "the OLT is refusing to serve us" and "vendor identity mismatch" theory at once. Mine was climbing.

Rung 3. Compare GEM Non-Idle to the Ethernet "Total Unicast" counter. These two bracket the on-stick datapath: between them sit de-encapsulation, bridging, VLAN operations, and the user-port MAC. They are not a clean before/after pair, and it matters to say why: GEM Non-Idle counts every non-idle downstream GEM frame, all cast types plus the OMCI management channel, while Total Unicast counts unicast egress only. The diag interface does not split GEM by cast, so a clean unicast-in, unicast-out comparison was never on offer. Take it cast-agnostic. On the broken firmware, over a fifteen-minute window, GEM Non-Idle counted 32,095 frames. OMCI cannot account for them, a few messages a second against tens of thousands of frames, so these are service frames that reached GEM. Across both Ethernet egress counters, exactly one surfaced: unicast 1, multicast 0. Whatever the cast mix arriving, essentially none of it egressed. That also rules out benign accounting, the worry that the GEM count is mostly multicast that simply never touches the unicast counter: multicast forwarded through normally would have ticked the multicast counter, not left it at zero. That single surfaced frame matters too: it proves the egress counters are live and wired on the broken build, so near-zero is real loss, not a counter that stopped incrementing. A 32000-to-1 drop localises the fault to the stick's internal datapath. It does not by itself acquit the OMCI config, since a bad OLT-pushed VLAN or filter rule could also drop frames after the GEM layer. Rung 4 settles that.

Rung 4. Diff the post-provisioning Managed Entity state between the two firmwares. The MEs are the OMCI-managed config the OLT pushes into the stick (G.988). If identical, the bug is not in what the OLT pushed or how the stick received it. It is in what compiled code does with the frames afterward. The ME state was byte-identical. That, with Rung 3, kills the OMCI-config and OMCI-identity theories together.

The decision took seconds once the counters were in hand. The only wait was letting the broken-firmware counters accumulate. Run the ladder first and you skip all seven dead ends.

The point

Counter-driven hypotheses are bounded by where the counters disagree. Diff-driven hypotheses are bounded by nothing. When two counters that bracket a datapath report 32,095 and 1, the question is not what changed in the firmware. It is what can physically produce that ratio, and that list is short enough to finish.

The GPON details do not transfer, but the move does. For any system where data crosses layers, find the counters at each boundary, order them outside-in, and ask the counters where the bug is before you ask the diff.

Receipts and how far the trail goes

The rest of this is for anyone who wants the proof rather than the lesson.

Rung 1, that the ONU reached O5 on the broken firmware, from the stick's diag shell:

# diag gpon get onu-state
ONU state: Operation State(O5)

The Rung 2 and Rung 3 counters come from the same shell (diag gpon show counter global ds-gem and ... ds-eth), not from OMCI performance-monitoring MEs. On this firmware those PM MEs are instantiated but empty, so the diag interface is the real source. A second capture, a different session, showed the same failure:

# diag gpon show counter global ds-gem
     GPON ONU MAC Device Counter: DS GEM
D/S GEM Idle    : 3799846929
D/S GEM Non Idle: 67238
# diag gpon show counter global ds-eth
     GPON ONU MAC Device Counter: DS ETH
Total Unicast   : 156
Total Multicast : 0

67,238 GEM frames non-idle, 156 on Ethernet. The drop is near-total in both captures but not absolute: 1 frame got through in the fifteen-minute window, 156 in this one. A residue like that is what you would expect from CPU-path traffic (control frames, ARP, anything the stick handles in software rather than forwarding in hardware) leaking past a hardware-level break, though I did not classify the surviving frames to confirm it.

The same capture reads Total Multicast : 0 even though link-local multicast reached the router earlier. I cannot fully reconcile those with what I measured, and I will not invent a mechanism to paper it over. Two candidates, neither confirmed: the multicast may never have crossed the stick's downstream Ethernet MAC at all (I never pinned its source, and a nearest-neighbor LLDP or an MNDP broadcast can reach the router by other paths), or CPU-injected frames may egress without incrementing that counter on this chip, which is SoC-dependent and I did not verify it for the 9602C. Either way it does not touch the unicast result.

The broken-firmware downstream rate is low, about 36 frames a second over the fifteen-minute window against 645 a second of egress on the working build. That is expected: the host behind the stick was isolated and initiated nothing, so the only downstream traffic was unsolicited background. It also underlines that the load-bearing claim is the near-total ratio, not the absolute count. The working-firmware figure (232,260 unicast in six minutes) is only a baseline showing the counter is normally large and nonzero, not a matched comparison.

Rung 5: inside the stick

Rung 4 puts the fault in compiled code below OMCI. You can get a shell on the stick and narrow further, though this is where confirmed evidence starts handing off to inference, and the handoff happens earlier than I would like.

Confirmed:

The loadable kernel modules are effectively identical. I disassembled both with objdump -d and diffed the instruction streams: pf_rtk.ko, the data-plane bridge module, has zero differing instructions; omcidrv.ko differs by a single byte in .rodata (one character, 0x67 to 0x71), consistent with a build-stamp or version character and not code. (Raw .ko bytes differ in vermagic and relocations regardless, which is why I compared disassembly, not bytes.) Neither module is the bug.
The kernel image content differs. Both decompressed images are the same size and a raw byte diff is large, but that figure is close to meaningless: inserting code shifts every later byte, so a positional diff of two linked images is dominated by relocation, not real change. The load-bearing comparison is a strings set-diff, which surfaces the new V1.1.8 code: changes in lan_sds_main.c (the LAN-side SerDes driver), an SFP-application IPC channel, an EEPROM mirror debug interface, and a function trtk_gponapp_omci_mirror_set, all under drivers/net/rtl86900/sdk/src/module/lan_sds/.
V1.1.8 runs a kernel thread the working firmware does not:

383 admin   0 SW<  [sfp_main]

In /proc/rtl8686gmac/dev_port_mapping, the rx and tx forwarding port masks are byte-identical between firmwares. The one delta is in the carrier mapping: Port0's link-state notification moves from the PON interface to the Ethernet interface.

  V1.0:    Port0 => ifname:pon0.2 , dev:pon0
  V1.1.8:  Port0 => ifname:eth0   , dev:eth0

The Linux-visible config (brctl, ebtables, ip link, /proc/net/*) is identical between firmwares. The forwarded datapath does not run in the Linux stack at all: pon0 shows zero packets in /proc/net/dev while the fabric's diag counters show tens of thousands of downstream frames. The only traffic Linux sees is host-directed management to the stick's own address, not the forwarded service path. Forwarding happens in the switch fabric, in silicon, below where tcpdump or netdev counters can see it. That fabric-versus-stack split is why the fault hides from every host-side tool.

Where I would read first:

The new code splits into two kinds. The EEPROM-mirror interface and trtk_gponapp_omci_mirror_set read as host-presentation and diagnostics plumbing, the stick exposing its EEPROM and OMCI state to the host, plausibly unrelated to a forwarding drop. The third piece, lan_sds_main.c, expands by its name to the LAN-side SerDes driver, though I did not read the source to confirm the module is what the name implies. It is the only new piece that sounds like it sits near the egress path, so it is where I would open the disassembler first.

But the name, if it is right, argues against making it the lead. A SerDes carries bits; it does not distinguish cast. A fault confined to a SerDes driver gives you a dead or degraded link, which the link being up already excludes, or a corrupted mapping that the forwarding engine then misuses. Deciding which frames forward and which get trapped is done above the SerDes, in the forwarding and lookup logic. So whichever way I push it the likelier seat is the forwarding code, not the SerDes, and the forwarding tables I can see are byte-identical between firmwares, which puts that fault in compiled fabric programming below the visible config, exactly where Rung 4 left it. The new code tells me where to start reading, not where the bug is.

It does reconnect to the symptom I misread at the start, with the layer corrected. Multicast crossing the link killed the physical SerDes theory: a dead link carries nothing. My first instinct pointed at roughly the right region of the stick and the wrong layer. The corrected version points one layer up, at the forwarding logic, still compiled and still below what these dumps show.

What I still did not establish:

I did not read the driver, so I have not confirmed that lan_sds is where the drop happens rather than merely where code changed, nor even that the module is the SerDes driver its name suggests.
The cast-selective picture itself is not pinned. "Unicast drops while multicast passes the same port" would be a clean forwarding signature, but with the Ethernet multicast counter at zero I cannot show multicast passed; the survivors either took another route or were not counted there.
The Port0 carrier change points the same LAN-ward direction (its notification moves from the PON side to the Ethernet side). It is link-state signalling, not forwarding (the forwarding masks were identical), so I am not blaming it, only noting it rhymes.
I excluded the loadable modules by disassembly. Within the kernel I did not exclude my way to a culprit; I selected the area to read first by correlation (new code, the [sfp_main] thread), and stopped there.

So the trail ends at a layer, not a line number: the fault is below OMCI, in compiled fabric forwarding code, and I confirmed which large pieces it is not. I have a new-code area worth reading first and a reasoned argument that the seat is one layer above it. I did not read the driver to settle that, and from outside the binary I cannot; that is a disassembly session, not more reasoning. The practical fix is to stay on V1.0-220923. The transferable result is the ladder, and the discipline of marking exactly where the evidence stops and the inference starts, even when that line falls earlier than the story would like.

Search This Blog

Wily raconteur