Aggressive yet sane persistent SSH with systemd and autossh

Not too long ago, I was contracted to assist with a K8 deployment and the approach of the developers to setting up persistent SSH tunnels left something to be desired.

Admin Mourning

Autossh is a great tool for persistent SSH connections, I use it mostly for reverse port and socket forwarding. No punching holes in firewalls or exposing services to the Internet, I love it.

Folksy ideas like the gems in these comments however do little to halt the spread of disinformation.

At the outset, creating persistent SSH connections using Autossh seems simple enough. I mean, autossh takes care of tearing down the tunnel and re-creating it, right. Right?
The answer is, mostly yes, but TCP rules make it messy. And since not all developers are familiar with TCP, this post will explain why and how autossh should really be configured.

Before we start, let's get one thing out of the way, while systemd can manage ssh connections, if you are looking for persistence, you are likely not calling ssh for just a remote shell to work in. This means, there are knobs that may need to be turned before, during and after the session is started and restarted.
Systemd is the perfect wrapper to AutoSSH in this scenario. AutoSSH manages the connections, while systemd manages the knobs. If you are using SSH just for remote port forwarding, not remote socket forwarding, you could get away without autossh.

Remember that SSH uses TCP, and this makes it reliable, and while great for a quick SOCKS proxy "VPN", it is susceptible to the TCP over TCP problem. It is why VPNs use UDP.

TCP CONNECTION TERMINATION

The crux of this story..


A device in the CLOSE-WAIT state does not know how long it will take for the connection to close. This is where TCP's TIME-WAIT state comes in, it provides enough time for a FIN --> <--ACK and provides a buffer for re-transmission of TCP states. This time period circa 2021 is a sensible-ish 60 seconds.

On Linux, some TCP features are not tunable, specifically TCP_TIMEWAIT_LEN is hardcoded.

#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT                                   * state, about 60 seconds */

There have been proposals to turn this into a tunable value but it has been refused on the ground that a fixed TIME-WAIT state is a good thing for internetworks.

You can view your system's related TCP_FIN_TIMEOUT by:
$ cat /proc/sys/net/ipv4/tcp_fin_timeout 60

So a recent kernel will wait 60 seconds to tear down an unresponsive connection. This is a key piece in our attempt to manage persistent TCP/SSH connections via systemd.

Set aggressive network timeouts, not systemd unit timeouts, this will also avoid Broken pipe errors. Configure ssh such that after 30 seconds of no server response, autossh will re-negotiate a new session. This assumes SSH version >=2.
Set 'ClientAliveInterval 10' in the remote sshd_config so unresponsive SSH clients will be disconnected after approximately (10 x ClientAliveCountMax = 30 seconds).

ExecStart=/usr/bin/autossh \
            -M 0 -o "ServerAliveInterval 10" \
            -o "ServerAliveCountMax 3" \
            -o "StreamLocalBindUnlink yes" \
            -o "ExitOnForwardFailure yes" \
            -L /var/run/sig.sock:/var/run/sig.sock \
            -N ssh.server #sleep 10

In the example above, the ssh client will disconnect from an unresponsive server in approximately (10 x ServerAliveCountMax = 30 seconds).

We set sleep 10 to make SSH exit in case no TCP connections are forwarded in 10 seconds.
This isn't required, but is shown here commented, as it could make scripting SSH connections easy since we can now get the remote shell's return codes.

So, if we wish to keep an SSH connection up aggressively, we should aim to never hit TCP timeout limits. This is achieved by setting ServerAliveInterval in the client and ClientAliveInterval on the server, since these both are set to 0 in ssh and sshd and are effectively unset by default.

Set ClientAliveInterval 10 in the remote sshd_config so unresponsive SSH clients will be disconnected after approximately (10 x ClientAliveCountMax = 30 seconds). If all else fails, allow a graceful TCP time out before restarting the systemd service:

Restart=always
RestartSec=60

TCMP

Here's the result, nothing too complicated, but sensible enough to co-exist with multiple Autossh connections and not go crazy fighting the kernel by incessantly restarting the service if it drops.

$ cat ~/.config/systemd/user/signald_autossh.service
[Unit]
Description=AutoSSH service to remotely access signald's unix socket
After=network-online.target
#After=network-online.target sshd.service # Use this instead if autossh will interact with the local SSH server

[Service]
Environment="AUTOSSH_GATETIME=30"
Environment="AUTOSSH_POLL=30"
Environment="AUTOSSH_FIRST_POLL=30"
Environment="SSOCK=/var/run/signald/signald.sock"

ExecStart=/usr/bin/autossh -M 0 -o "ServerAliveInterval 10" -o "ServerAliveCountMax 3" -o "StreamLocalBindUnlink yes" -o "ExitOnForwardFailure yes" -L ${SSOCK}:${SSOCK} -N remote.server
ExecStop=/usr/bin/kill $MAINPID
ExecReload=/usr/bin/kill -HUP $MAINPID

Restart=always
RestartSec=60
KillMode=control-group

[Install]
WantedBy=default.target
If you wish to remote port forward instead of remote socket forward like the example above, do "-L 9090:localhost:9090 -N remote.server" and remove the "-o "StreamLocalBindUnlink yes" in your autossh connection string. Alternately, all this could be configured in the server's sshd_config if you don't trust your ssh clients.

Using something like Rapid SSH Proxy would side step this issue completely.

Comments