Tidbits - a Linux NFS client bug
Just had a weird new NFS client bug (with Linux 4.0.1 - I'm suffering from various NFS bugs since about 2.6.18, so debugging NFS issues is kind of routine).
I don't know what triggered it, but my girlfriend complained her vi froze, and couldn't explain why (and neither could I). As it turned out, it was stuck in fsync
, and since she was editing on an NFS volume, I quickly guessed it's an NFS problem, as vim was probably trying to write its swapfile.
tcpdump quickly showed the problem - the NFS client tried to connect to the server every three seconds, but was rejected:
23:28:06.188527 IP 10.0.0.230.684 > 10.0.0.5.2049: Flags [S], seq 8126 23:28:06.188576 IP 10.0.0.230.684 > 10.0.0.5.2049: Flags [S], seq 9041 23:28:06.189458 IP 10.0.0.5.2049 > 10.0.0.230.684: Flags [S.], seq 3788, ack 8127 23:28:06.189477 IP 10.0.0.230.684 > 10.0.0.5.2049: Flags [R], seq 8127, win 0 23:28:06.189926 IP 10.0.0.5.2049 > 10.0.0.230.684: Flags [R.], seq 3508, ack 916, win 0
After checking firewall, conntrack etc., I looked closer after the actual tcpdump output and realised what the problem was: the client 10.0.0.230
sent out two SYNs in very quick succession (within less than 50µs), but the SYNs were actually for two different connections!
The server 10.0.0.5
replies to the first, and probably dropped the second. The NFS client then rejected the SYN ACK from the server, presumably because it had already forgotten the first connection request.
This is the moment where I already considered rebooting, as my experience with Linux NFS problems is that you can't solve them without rebooting, as you usually can't umount or remount the mountpoint when it is stuck.
Workaround
Fortunately, it turned out to be fixable, at least temporarily:
mount -noremount,udp /path/to/mountpoint
This took a few seconds, then returned, together with all the other commands that were previously stuck (vim, sync, ls...).
I had expected it to switch the client to UDP, but it stayed with UDP, so it seems the remount alone sufficed.
I've never seen this failure mode before, so maybe it was a one time glitch. Let's wait and see...