pam_authnft: what 'session' means to a Linux firewall

On Linux, the hard part of building a per-session firewall is not the firewall. It's deciding what "session" means in a way the kernel can match against later, without trusting anything in userspace to have told the truth.

pam_authnft is the smallest concrete experiment I could build to poke at that question. It's a PAM session module that binds nftables rules to an authenticated session using the session's cgroupv2 inode as the identity. SSH in, your firewall rules appear; log out, they're gone. No setuid helper, no dedicated shell, no kernel patches.

The framing I keep coming back to has four verbs. To give a session, a workload, or any other unit of activity a kernel-visible identity, you have to

create it somewhere
store it in something durable
transport it through whatever subsystem boundaries the packets cross
and verify it where the policy decision lives

Most of the interesting failure modes come from one of those four steps being delegated to a structure that was not designed for it. skb->mark is the store step quietly failing across IPsec/XFRM transform boundaries, where marks set on the outer encrypted packet leak into the decrypted inner one. Source-IP-based identity is the transport step failing the moment a packet crosses a NAT or masquerade boundary. Classical identd (RFC 1413) is the verify step trusting an answer the entity being verified is the one supplying. pam_authnft picks, for each verb, a kernel structure that was designed to do that thing rather than one that ended up doing it.

Under the hood

OpenBSD has authpf: log in, pf gains a per-user anchor; log out, the anchor goes away. The trick on BSD is that authpf is the user's login shell, so the lifecycle of the rules is the lifecycle of the shell process. Nothing stops you from doing the same thing on Linux, but a login-shell wrapper only sees sessions that actually run a shell, and a lot of authenticated sessions (sftp, scp, rsync over SSH, an OpenVPN connection that terminates in a routing change) never do. PAM session hooks fire on every authenticated session managed by PAM. That is the layer pam_authnft sits at.

The pieces have been in the Linux kernel tree for years: systemd transient .scope units (so every session lands in its own named cgroup), cgroupv2 (each cgroup directory gets a stable inode), and nftables meta cgroup (so a rule can match against the cgroup of a socket's originating process). On pam_open_session the module asks systemd over D-Bus to create a transient scope, calls stat(2) on the cgroup directory to read its inode, and inserts { inode . src_ip } into an nftables set typed typeof meta cgroup . ip saddr. A root-owned fragment under /etc/authnft/users/<name>, validated for ownership and mode before loading, supplies the rules that reference the set. The username is rejected for path traversal or shell metacharacters; the remote host is rejected if it isn't a parseable IP, which means console logins and su sessions without a network peer are passed through unmodified. Because meta cgroup needs a local socket, this is for traffic the box originates or terminates, not traffic it forwards through. On pam_close_session the element is deleted. The PAM process runs under a seccomp-BPF allowlist derived from a full strace of an open/close cycle.

# nft list table inet authnft
table inet authnft {
    set session_map_ipv4 {
        typeof meta cgroup . ip saddr
        flags timeout
        elements = { 27711 . 127.0.0.1 timeout 1d expires 23h55m56s comment "authnft-test (PID:1127936)" }
    }

    set session_map_ipv6 {
        typeof meta cgroup . ip6 saddr
        flags timeout
    }

    chain filter {
        type filter hook input priority filter - 1; policy accept;
        meta cgroup . ip saddr @session_map_ipv4 accept
    }
}

27711 is the cgroupv2 inode of the session's transient scope (authnft-authnft-test-1127936.scope), matched by meta cgroup against the socket's originating cgroup at classification time.

Session state is inspectable with plain nft list table inet authnft: no bpftool, no BPF program introspection, no agent on a socket. The other obvious way to do per-cgroup filtering on Linux is cgroup-BPF; pam_authnft picks the older substrate so an admin debugging at 3am needs to know nftables and nothing else.

Why the cgroup inode is the interesting choice

The intuition is the one you have when you take a phone call. Once you have established who is on the line, you do not keep asking. The call carries the identity for its lifetime, and both ends behave as if it is trusted until somebody hangs up.

This is not a new intuition on the firewall side. Stateful packet inspection, commercialised by Check Point's FireWall-1 in the early 1990s, was the move from treating each packet as an independent decision to treating it as a member of a flow. Linux's conntrack and OpenBSD's pf state tables are the direct kernel-side descendants of that idea. pam_authnft does the same shape one rung coarser: where conntrack tracks state per flow, pam_authnft tags state per session, and the cgroup inode is what carries the tag. Every connection from a process inside the session cgroup inherits it, and the two compose without arguing.

The interesting question is which kernel-side value can carry that "who is on the line" fact at the session layer in a way nothing else on the box can lie about. The candidates are not equally good. Source IP gets reused, NATed, and is owned by whoever controls the network rather than whoever controls the workload. Firewall marks (skb->mark) are a 32-bit untyped global field any kernel subsystem can write, and they leak across IPsec/XFRM transform boundaries in ways the 2022 LWN discussion documents in detail. bpf_sk_storage is much better, but it requires a BPF program in the path and a verifier story for whoever inspects it later, which is overbuilt for "this user logged in over SSH". Two more carriers a Linux person will ask about: loginuid from the audit subsystem (set by pam_loginuid, immutable for the process tree, but not matchable from netfilter) and nftables' own meta skuid (matches socket-owner credentials, but those are mutable per process via setuid or privilege drops). All five are good fits for the jobs they were designed for; none of them pin identity to the lifetime of a PAM session in a way that survives privilege changes and is also matchable from nftables.

The cgroupv2 inode does. It is assigned by the kernel, stable for the lifetime of the cgroup, namespaced, not writable from userspace, and read straight from the socket's originating cgroup at packet-classification time. No rule in any other nftables table can spoof the value the match reads. A privileged process can move itself between cgroups, but that is the same trust boundary that already governs writes to cgroup.procs; pam_authnft does not introduce a new one.

I have a personal piece of evidence about how easy it is to get the trust question wrong. In a 2007 thread on the OpenBSD Journal (a submission of mine, reposted on my own site years later) I proposed turning spamd's own logs into a private blocklist. Matthew Dempsky pointed out in the comments that this was trivially forgeable: anyone could send a fake postmaster@your.ip from a free webmail account and get arbitrary IPs added to the list. The criticism was correct, the author (me) was wrong, and the lesson stuck: an identity carrier is only as trustworthy as the path by which it gets its value.

So the experiment, restated against the four verbs: create the identity by asking systemd to put the session in its own scope, store it as the cgroup inode in an nftables set, transport it implicitly via the socket-to-cgroup walk the kernel already does for every packet, and verify it via meta cgroup at classification time. Four kernel mechanisms, no new ones, and at every step the value being trusted is one the kernel maintains itself.

Thoughts this experiment keeps reminding me of

The first is whether identity belongs at the layer the kernel currently tries to handle it, or further down. Every kernel-side carrier (skb->mark, bpf_sk_storage, conntrack helpers, IPsec SPIs, the cgroup inode pam_authnft uses) hits the same ceiling: the kernel is itself the trust root, so any identity it carries is only as trustworthy as the kernel running it. There is a long line of work that puts identity below the kernel instead, in TPMs, in confidential-computing attestation, in IEEE 802.1AR DevID silicon, in the IETF's Remote ATtestation procedureS group. They answer the same question from a place where the kernel is being measured rather than doing the measuring. (Yes, that list includes bpf_sk_storage, even though I praised it earlier as much better than the mark; both readings sit on different axes, intra-kernel robustness versus the kernel as trust root.)

The second is whether we should be putting identity on the wire at all. Every few years someone proposes a new packet field, a new tag, a new label, and the proposal hits the same wall: routers exist, they cost money, they cannot all be replaced, and even if they could the field would ossify on first deployment. Meanwhile, the part of the internet that actually has to talk to spacecraft gave up on extending IP twenty years ago and uses Delay-Tolerant Networking with its own addressing scheme entirely (RFC 4838, then BPv7 in RFC 9171). The lesson is not "build DTN for everything"; it is "if your identity scheme requires every router on the planet to learn a new packet field, you have already lost". The cgroup inode is the opposite move: keep the wire format unchanged, push the identity to a place the local kernel can read without anyone else needing to know.

The post also does not name a threat model. Identity in the kernel means very different things against an unprivileged local user, a root-equivalent process, or a compromised kernel, and the problem is hard enough at the design level without locking the post into one of those upfront.

What's actually in the repo

752 lines of C across six files (501 in the module, 178 in tests), a Makefile, an example slice unit, a generator script for common fragment patterns, and a test suite covering username validation, the seccomp allowlist boundary, libnftables dry-run parsing, end-to-end fragment loading, group-membership gating, and Valgrind clean-up. The README's limitations section is the honest list, including the case where a forking PAM-invoking daemon resolves a different cgroup than the eventual session.

The narrow question this post has answered, for one small experiment on one Linux box, is where identity can live in the kernel without anything else on the box being able to corrupt it.

The wider one I'd rather you took with you: on the box you're reading this on, what decides who owns a connection at the moment it's made, and what would have to break for that answer to start lying?

The repo is identd-ng/pam_authnft.

Search This Blog

Wily raconteur