I had problems with TCP connections (mostly long-lasting ssh sessions) getting dropped on my ADSL line. In the end, I found that the problem had two different roots. The detective work behind establishing them is, I believe, interesting. It also shows how accessible source code, and the will to use it, can be a tremendous boost to difficult system administration problems.
The first problem involved idle ssh connections getting disconnected after some time. I knew that the cause was my router clearing NAT entries of idle connections, and ensured that the sshd KeepAlive option was set. This did not solve the problem. However, I notice that not all hosts I used were dropping their idle connections, so I started by tracing packets to a host that dropped and one that didn't:
windump -i 6 host freefall.freebsd.org or host istlab.dmst.aueb.gr windump: listening on \Device\NPF_{688215B7-A2BE-4953-BC81-114456AEE710} 18:20:52.424148 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 3065 637357 win 58400 18:20:52.424168 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF) 18:30:52.742158 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 1 win 58400 18:30:52.742188 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF) 18:40:52.992644 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 1 win 58400 18:40:52.992669 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF) 18:50:53.247049 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 1 win 58400 18:50:53.247066 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF) 74742 packets received by filter 0 packets dropped by kernelAs you can see, freefall was sending keep-alive packets, but istlab wasn't.
Next step: examine the sshd source to see how the KeepAlive option is implemented:
$ grep KeepAlive *.c
readconf.c: { "keepalive", oKeepAlives },
case oKeepAlives:
intptr = &options->keepalives;
goto parse_flag;
$ grep keepalives *.c
sshd.c: if (options.keepalives &&
if (options.keepalives &&
setsockopt(sock_in, SOL_SOCKET, SO_KEEPALIVE, &on,
sizeof(on)) < 0)
error("setsockopt SO_KEEPALIVE: %.100s", strerror(errno));
$ grep SO_KEEPALIVE */*.c
netinet/tcp_timer.c: tp->t_inpcb->inp_socket->so_options & SO_KEEPALIVE) &&
if ((always_keepalive ||
tp->t_inpcb->inp_socket->so_options & SO_KEEPALIVE) &&
tp->t_state <= TCPS_CLOSING) {
if ((ticks - tp->t_rcvtime) >= tcp_keepidle + tcp_maxidle)
goto dropit;
int tcp_keepidle;
SYSCTL_PROC(_net_inet_tcp, TCPCTL_KEEPIDLE, keepidle, CTLTYPE_INT|CTLFLAG_RW,
&tcp_keepidle, 0, sysctl_msec_to_ticks, "I", "");
istlab$ sysctl net.inet.tcp.keepidle
net.inet.tcp.keepidle: 7200000
freefall$ sysctl net.inet.tcp.keepidle
net.inet.tcp.keepidle: 600000
coding ain't done till all tests run
windump -i 6 host freefall.freebsd.org or host istlab.dmst.aueb.gr windump: listening on \Device\NPF_{688215B7-A2BE-4953-BC81-114456AEE710} 21:09:11.331698 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 3065 651769 win 58400 21:09:11.331708 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 wi n 63080 (DF) 21:10:00.749281 IP istlab.dmst.aueb.gr.22 > eagle.spinellis.gr.4527: . ack 37848 16799 win 58400 21:10:00.749298 IP eagle.spinellis.gr.4527 > istlab.dmst.aueb.gr.22: . ack 1 win 63780 (DF)QED.
The other problem involved the ADSL PPP connection dropping every six hours. The suggestion offered by my ISPs helpdesk (at three different instances of the problem) was to reboot the SpeedTouch 530 rooter, because it was getting stuck (again, and again). They claimed that small routers tend to crash and often require rebooting. The persistent nature of the problem, and the fact that the connection was dropped after approximatelly six hours convinced me that the problem was more complicated.
I let the router operate without reboot for about a day, and observed the log. The router was picking a new global IP address after the link wend down, which happened 1h after the router got a new internal DHCP address (the same one). See the following three groups:
19:00:17 PPP link up (PPPoA_1) [194.219.141.49]
18:59:58 PPP link down (PPPoA_1) [212.54.221.194]
17:59:58 DHCP lease ip-address 192.168.136.30 bound to intf eth0
17:59:58 DHCP intf eth0 renews lease ip-address 192.168.136.30
12:51:38 PPP link up (PPPoA_1) [212.54.221.194]
12:51:20 PPP link down (PPPoA_1) [194.219.73.233]
11:59:58 DHCP lease ip-address 192.168.136.30 bound to intf eth0
11:59:58 DHCP intf eth0 renews lease ip-address 192.168.136.30
07:00:19 PPP link up (PPPoA_1) [194.219.73.233]
07:00:00 PPP link down (PPPoA_1) [194.219.73.150]
06:00:00 DHCP lease ip-address 192.168.136.30 bound to intf eth0
06:00:00 DHCP intf eth0 renews lease ip-address 192.168.136.30
00:00:22 PPP link up (PPPoA_1) [194.219.73.150]
00:00:16 xDSL linestate up (downstream: 448 kbit/s, upstream: 160 kbit/s)
00:00:02 DHCP Auto DHCP: server detected on LAN, own dhcp server disabled
00:00:02 DHCP lease ip-address 192.168.136.30 bound to intf eth0
00:00:02 DHCP 192.168.136.30 (255.255.255.0) set on intf eth0: ok.
"Once is happenstance. Twice is coincidence. Three times is enemy action." (Auric Goldfinger).So I assumed the problem was a router bug: "a new internal DHCP lease? Let's have fun restarting the external connection."
I increased the DHCP lease time for the router to 3 years, and the problem went away.
Last modified: Monday, August 23, 2004 11:12 am
Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.