pam_authnft: what 'session' means to a Linux firewall
On Linux, the hard part of building a per-session firewall is not the firewall. It's deciding what "session" means in a way the kernel can match against later, without trusting anything in userspace to have told the truth.
pam_authnft is the smallest concrete experiment I could build to push at that question. It's a PAM session module that binds nftables rules to an authenticated session using the session's cgroupv2 inode as the identity. SSH in, your firewall rules appear; log out, they're gone. No setuid helper, no dedicated shell, no kernel patches.
The framing I keep coming back to has four verbs. To give a session, a
workload, or any other unit of activity a kernel-visible identity, you
have to
- create it somewhere
- store it in something durable
- transport it through whatever subsystem boundaries the packets cross
- and verify it where the policy decision lives
Most of the interesting failure
modes come from one of those four steps being delegated to a structure
that was not designed for it. skb->mark is the store
step quietly failing across IPsec/XFRM transform boundaries, where
marks set on the outer encrypted packet leak into the decrypted inner
one. Source-IP-based identity is the transport step failing the moment a packet crosses a NAT or masquerade boundary. Classical identd (RFC 1413) is the verify
step trusting an answer the entity being verified is the one supplying.
pam_authnft picks, for each verb, a kernel structure that was designed
to do that thing rather than one that ended up doing it.
Under the hood
OpenBSD has authpf: log in, pf gains a per-user anchor;
log out, the anchor goes away. The trick on BSD is that authpf is the
user's login shell, so the lifecycle of the rules is the lifecycle of
the shell process. Nothing stops you from doing the same thing on Linux,
but a login-shell wrapper only sees sessions that actually run a shell,
and a lot of authenticated sessions (sftp, scp, rsync over SSH, an
OpenVPN connection that terminates in a routing change) never do. PAM
session hooks fire on every authenticated session managed by PAM. That
is the layer pam_authnft sits at.
The pieces have been in the Linux kernel tree for years: systemd transient .scope units (so every session lands in its own named cgroup), cgroupv2 (each cgroup directory gets a stable inode), and nftables meta cgroup (so a rule can match against the cgroup of a socket's originating process). On pam_open_session the module asks systemd over D-Bus to create a transient scope, calls stat(2) on the cgroup directory to read its inode, and inserts { inode . src_ip } into an nftables set typed typeof meta cgroup . ip saddr. A root-owned fragment under /etc/authnft/users/<name>,
validated for ownership and mode before loading, supplies the rules
that reference the set. The username is rejected for path traversal or
shell metacharacters; the remote host is rejected if it isn't a
parseable IP, which means console logins and su sessions without a network peer are passed through unmodified. Because meta cgroup needs a local socket, this is for traffic the box originates or terminates, not traffic it forwards through. On pam_close_session the element is deleted. The PAM process runs under a seccomp-BPF allowlist derived from a full strace of an open/close cycle.
Session state is inspectable with plain nft list table inet authnft: no bpftool,
no BPF program introspection, no agent on a socket. The other obvious
way to do per-cgroup filtering on Linux is cgroup-BPF; pam_authnft picks
the older substrate so an admin debugging at 3am needs to know nftables
and nothing else.
Why the cgroup inode is the interesting choice
The intuition is the one you have when you take a phone call. Once you have established who is on the line, you do not keep asking. The call carries the identity for its lifetime, and both ends behave as if it is trusted until somebody hangs up.
This is not a new intuition on the firewall side. Stateful packet
inspection, commercialised by Check Point's FireWall-1 in the early
1990s, was the move from treating each packet as an independent decision
to treating it as a member of a flow. Linux's conntrack and OpenBSD's pf
state tables are the direct kernel-side descendants of that idea.
pam_authnft does the same shape one rung coarser: where conntrack tracks
state per flow, pam_authnft tags state per session, and the cgroup
inode is what carries the tag. Every connection from a process inside
the session cgroup inherits it, and the two compose without arguing.
The interesting question is which kernel-side value can carry that
"who is on the line" fact at the session layer in a way nothing else on
the box can lie about. The candidates are not equally good. Source IP
gets reused, NATed, and is owned by whoever controls the network rather
than whoever controls the workload. Firewall marks (skb->mark)
are a 32-bit untyped global field any kernel subsystem can write, and
they leak across IPsec/XFRM transform boundaries in ways the 2022 LWN discussion documents in detail. bpf_sk_storage
is much better, but it requires a BPF program in the path and a
verifier story for whoever inspects it later, which is overbuilt for
"this user logged in over SSH". Two more carriers a Linux person will
ask about: loginuid from the audit subsystem (set by pam_loginuid, immutable for the process tree, but not matchable from netfilter) and nftables' own meta skuid
(matches socket-owner credentials, but those are mutable per process
via setuid or privilege drops). All five are good fits for the jobs they
were designed for; none of them pin identity to the lifetime of a PAM
session in a way that survives privilege changes and is also matchable
from nftables.
The cgroupv2 inode does. It is assigned by the kernel, stable for the
lifetime of the cgroup, namespaced, not writable from userspace, and
read straight from the socket's originating cgroup at
packet-classification time. No rule in any other nftables table can
spoof the value the match reads. A privileged process can move itself
between cgroups, but that is the same trust boundary that already
governs writes to cgroup.procs; pam_authnft does not introduce a new one.
I have a personal piece of evidence about how easy it is to get the trust question wrong. In a 2007 thread on the OpenBSD Journal (a submission of mine, reposted on my own site years later) I proposed turning spamd's own logs into a private blocklist. Matthew Dempsky pointed out in the comments that this was trivially forgeable: anyone could send a fake postmaster@your.ip
from a free webmail account and get arbitrary IPs added to the list.
The criticism was correct, the author (me) was wrong, and the lesson
stuck: an identity carrier is only as trustworthy as the path by which
it gets its value.
So the experiment, restated against the four verbs: create the identity by asking systemd to put the session in its own scope, store it as the cgroup inode in an nftables set, transport it implicitly via the socket-to-cgroup walk the kernel already does for every packet, and verify it via meta cgroup
at classification time. Four kernel mechanisms, no new ones, and at
every step the value being trusted is one the kernel maintains itself.
Thoughts this experiment keeps reminding me of
The first is whether identity belongs at the layer the kernel
currently tries to handle it, or further down. Every kernel-side carrier
(skb->mark, bpf_sk_storage, conntrack
helpers, IPsec SPIs, the cgroup inode pam_authnft uses) hits the same
ceiling: the kernel is itself the trust root, so any identity it carries
is only as trustworthy as the kernel running it. There is a long line
of work that puts identity below the kernel instead, in TPMs, in
confidential-computing attestation, in IEEE 802.1AR DevID silicon, in
the IETF's Remote ATtestation procedureS group. They answer the same
question from a place where the kernel is being measured rather than
doing the measuring. (Yes, that list includes bpf_sk_storage,
even though I praised it earlier as much better than the mark; both
readings sit on different axes, intra-kernel robustness versus the
kernel as trust root.)
The second is whether we should be putting identity on the wire at all. Every few years someone proposes a new packet field, a new tag, a new label, and the proposal hits the same wall: routers exist, they cost money, they cannot all be replaced, and even if they could the field would ossify on first deployment. Meanwhile, the part of the internet that actually has to talk to spacecraft gave up on extending IP twenty years ago and uses Delay-Tolerant Networking with its own addressing scheme entirely (RFC 4838, then BPv7 in RFC 9171). The lesson is not "build DTN for everything"; it is "if your identity scheme requires every router on the planet to learn a new packet field, you have already lost". The cgroup inode is the opposite move: keep the wire format unchanged, push the identity to a place the local kernel can read without anyone else needing to know.
The post also does not name a threat model. Identity in the kernel means very different things against an unprivileged local user, a root-equivalent process, or a compromised kernel, and the problem is hard enough at the design level without locking the post into one of those upfront.
What's actually in the repo
752 lines of C across six files (501 in the module, 178 in tests), a Makefile, an example slice unit, a generator script for common fragment patterns, and a test suite covering username validation, the seccomp allowlist boundary, libnftables dry-run parsing, end-to-end fragment loading, group-membership gating, and Valgrind clean-up. The README's limitations section is the honest list, including the case where a forking PAM-invoking daemon resolves a different cgroup than the eventual session.
The narrow question this post has answered, for one small experiment on one Linux box, is where identity can live in the kernel without anything else on the box being able to corrupt it.
The wider one I'd rather you took with you: on the box you're reading this on, what decides who owns a connection at the moment it's made, and what would have to break for that answer to start lying?
The repo is identd-ng/pam_authnft.
Related reading
- systemd.resource-control(5). The closest extant alternative on stock Linux, using cgroup-BPF rather than nftables but operating on the same trust model.
- "Identity management for WireGuard", LWN 2022. The comment thread documents the
skb->markleak across IPsec/XFRM transform boundaries. - authpf(8). The OpenBSD tool pam_authnft is the Linux-side answer to.
- Daniel Hartmeier on pf and spamd. The canonical pf+spamd writeup, hosted by the original author of
pf. - IETF Remote ATtestation procedureS working group. Standards work for asking whether the kernel you're talking to is the one you think you're talking to.
- RFC 4838 and RFC 9171. The DTN architecture and Bundle Protocol v7, what the part of the internet that talks to spacecraft uses instead of extending IP.
- RFC 8300 (Network Service Header). The on-the-wire identity-carrier school of thought the second open question is sceptical of.
- SPIFFE and SPIRE. Workload identity at the TLS and application layer; the above-the-kernel cousin to the kernel-side carrier pam_authnft uses.
Comments
Post a Comment