|
|
I have an older computer and a newer computer both running Linux. They both sit on my desk, and connect to the same router. However, I can't connect to Grex from the new computer. I can ping grex just fine, but ssh, http, https, and telnet seem to fail. Attempts to connect just hang. I can login with ftp, but as soon as I attempt to do anything that sends a bigger hunk of data, it hangs and eventually the connection times out. I haven't noticed any computers other than Grex that the new computer can't connect to. The old computer connects to Grex just fine. The newer computer is a 64-bit system, for what it's worth. I'm guessing that it has something to do with fragmentation, but even if that is true, I don't know what to do about it. Any suggestions?
26 responses total.
I suspect there might be some subtle bug in Grex's TCP stack that could be causing these problems, or perhaps its a bug in our firewall configuration. I'd like to upgrade grex to the latest version of OpenBSD to see if that fixes these sorts of problems (that we're hearing about more and more frequently, but not at an alarming rate), as well as some of the mail problems we've been experiencing.
Offhand, I wonder whether this behavior has anything to do with the problems that some mail servers (e.g. U of M's ITD) have establishing connections to Grex's mail server.
FWIW, I just booted my new system into Windows (which I pretty much use exclusively for testing if things work in Windows these days), and it works in windows. This suggests to me that there ought to be some kind of setting I can change on the Linux side that would make it work too. But I don't even know how to begin to look.
I'd start by googling the exact error message you get, plus "OpenBSD" plus (whatever version of Linux you are running at home.) If it's a well-known problem my guess is it'll have made it into a support forum somewhere.
Regarding #2; I've begun to strongly suspect that that is, in fact, the case.
Could you post the kernel and distro details? "fudge" had a similar problem some time back (thread #622 oldunix) however no one mentioned a fix so i don't suppose that thread would be of any value..
I don't get an error message. The connection for http and ssh just hangs. If I try to telnet to Grex, it claims to have connected, but I never get a login prompt. If I hit keys they just echo back to me, returns echoing as ^M and so forth. I can't find any messages in log files that seem to relate. Dunno if any messages appear in Grex's log files. I'm running openSUSE 10.2. The kernel version is 2.6.18.2-34-default. The processor is a 64-bit AMD processor, and this is a 64-bit version of SUSE.
I tried watching the logs on Grex while I tried to connect via ssh. Nothing
seems to get logged when I connect, but when I disconnect, I /var/log/authlog
says
Jun 24 12:13:40 grex sshd[21001]: Connection closed by 66.167.211.109
Since this doesn't seem to be an exceptionally common message in authlog,
I assume it means that the connection was broken while we were still trying
to exchange startup data for the ssh connection.
Here's an attempt at an ftp connection: % ftp grex.org Connected to grex.org. 220 grex.cyberspace.org FTP server (Version 6.6/OpenBSD) ready. Name (grex.org:jan): janc 331 Password required for janc. Password: 230 User janc logged in. Remote system type is UNIX. Using binary mode to transfer files. ftp> ls 421 Service not available, remote server timed out. Connection closed ftp> Looks like some data was exchanged, but not much.
During an ftp login like the one above, Grex's xferlog says: Jun 24 12:26:14 grex ftpd[10154]: connection from h-66-167-211-109.sfldmidn.dynamic.covad.net Jun 24 12:26:42 grex ftpd[25363]: FTP LOGIN FROM h-66-167-211-109.sfldmidn.dynamic.covad.net as janc The password I enter is successfully sent over, and I get confirmation that it is correct (or not), but can't actually seem to do much of anything.
If I wait on an ssh connection for long enough it eventually times out. Grex's authlog file says Jun 24 12:23:32 grex sshd[11442]: fatal: Timeout before authentication for 66.167.211.109 Jun 24 12:23:32 grex sshd[4689]: fatal: Timeout before authentication for 66.167.211.109 On my end, I get Read from socket failed: Connection reset by peer I think the Grex end timed out well before my end timed out, but both took a while.
Though it might be difficult to distinguish what is happening, perhaps a trace from a sniffer like ethereal/wireshark might help determine what's going on. The divergence between the unsuccessful connection attempt and the successful one will almost certainly occur very early in the conversation.
I grep through all the logs for my IP address and didn't find any other hints about what might be going on. There are some successfull http requests logged, but those are from one of the other computers in my house. Here's the output from "ssh -v -l janc grex.org": % ssh -v -l janc grex.org OpenSSH_4.4p1, OpenSSL 0.9.8d 28 Sep 2006 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug1: Connecting to grex.org [216.86.77.194] port 22. debug1: Connection established. debug1: identity file /home/jan/.ssh/identity type -1 debug1: identity file /home/jan/.ssh/id_rsa type -1 debug1: identity file /home/jan/.ssh/id_dsa type -1 debug1: Remote protocol version 1.99, remote software version OpenSSH_4.2 debug1: match: OpenSSH_4.2 pat OpenSSH* debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_4.4 debug1: SSH2_MSG_KEXINIT sent And then it hangs. If I telnet to port 22 (the ssh port) I get a "SSH-1.99-OpenSSH_4.2" message. I don't know enough about SSH protocol to know how to response, but at least a bit seems to be working.
I think I have sniffer software on my Linux box, which I used once before with limited success. I'll have to give that a try.
Even seeing what packets get sent via, e.g., tcpdump might be very useful.
I've put a packet capture from a machine which connects successfully (via ssh, running on Ubuntu Linux 6.10) in my home directory as ~mcnally/ssh_capture.pcap, in case Jan wants something to use for comparison purposes. Ethereal or wireshark will happily open it (or etherpeek or just about any decent modern sniffer.) It's a capture of me doing "ssh janc@cyberspace.org" from my machine (why janc? I figured it'd be easier to compare similar connection attempts.) I killed the ssh process once I got a "Password: " prompt, under the theory that whatever's happening to people who can't connect seems to be happening prior to that point.
Since it's simple, I started with the tcpdump. Here's a 'tcpdump -v' from when I did 'ssh -v -l janc grex.org' up until the time when it was firmly hung. "flounder.home" is my computer. 17:53:50.815541 IP (tos 0x0, ttl 64, id 63187, offset 0, flags [DF], proto: TCP (6), length: 60) flounder.home.14926 > grex.cyberspace.org.ssh: S, cksum 0x42df (correct), 3372245805:3372245805(0) win 5840 <mss 1460,sackOK,timestamp 150705662 0,nop,wscale 7> 17:53:50.883599 IP (tos 0x0, ttl 56, id 33799, offset 0, flags [none], proto: TCP (6), length: 64) grex.cyberspace.org.ssh > flounder.home.14926: S, cksum 0xc768 (correct), 3071614353:3071614353(0) ack 3372245806 win 16384 <mss 1452,nop,nop,sackOK,nop,wscale 0,nop,nop,timestamp 3040392798 150705662> 17:53:50.883669 IP (tos 0x0, ttl 64, id 63188, offset 0, flags [DF], proto: TCP (6), length: 52) flounder.home.14926 > grex.cyberspace.org.ssh: ., cksum 0x47ed (correct), ack 1 win 46 <nop,nop,timestamp 150705679 3040392798> 17:53:50.958296 IP (tos 0x0, ttl 56, id 35720, offset 0, flags [none], proto: TCP (6), length: 73) grex.cyberspace.org.ssh > flounder.home.14926: P, cksum 0x07de (correct), 1:22(21) ack 1 win 17280 <nop,nop,timestamp 3040392799 150705679> 17:53:50.958504 IP (tos 0x0, ttl 64, id 63189, offset 0, flags [DF], proto: TCP (6), length: 52) flounder.home.14926 > grex.cyberspace.org.ssh: ., cksum 0x47c5 (correct), ack 22 win 46 <nop,nop,timestamp 150705697 3040392799> 17:53:50.958714 IP (tos 0x0, ttl 64, id 63190, offset 0, flags [DF], proto: TCP (6), length: 72) flounder.home.14926 > grex.cyberspace.org.ssh: P, cksum 0x9103 (correct), 1:21(20) ack 22 win 46 <nop,nop,timestamp 150705697 3040392799> 17:53:51.208457 IP (tos 0x0, ttl 56, id 38997, offset 0, flags [none], proto: TCP (6), length: 52) grex.cyberspace.org.ssh > flounder.home.14926: ., cksum 0x019f (correct), ack 21 win 17280 <nop,nop,timestamp 3040392799 150705697> 17:53:51.208590 IP (tos 0x0, ttl 64, id 63191, offset 0, flags [DF], proto: TCP (6), length: 804) flounder.home.14926 > grex.cyberspace.org.ssh: P 21:773(752) ack 22 win 46 <nop,nop,timestamp 150705760 3040392799> 17:53:51.477530 IP (tos 0x0, ttl 56, id 39564, offset 0, flags [none], proto: TCP (6), length: 52) grex.cyberspace.org.ssh > flounder.home.14926: ., cksum 0xfe6e (correct), ack 773 win 17280 <nop,nop,timestamp 3040392800 150705760>
If I telnet to Grex, it connects, but I never get a password prompt. It just echo's back what I type. Each time I type a character, 'tcpdump -v' shows: 18:01:10.010262 IP (tos 0x10, ttl 64, id 36246, offset 0, flags [DF], proto: TCP (6), length: 53) flounder.home.14468 > grex.cyberspace.org.telnet: P, cksum 0x8870 (correct), 160:161(1) ack 67 win 46 <nop,nop,timestamp 150815453 2542462986> 18:01:10.245447 IP (tos 0x0, ttl 56, id 29727, offset 0, flags [DF], proto: TCP (6), length: 52) grex.cyberspace.org.telnet > flounder.home.14468: ., cksum 0x9bfd (correct), ack 161 win 17280 <nop,nop,timestamp 2542463137 150815453> which looks like the character being sent and echoed back.
I have this problem sometimes from one computer running fc5+. Everything else in the house works. I did a trace back a while and it looked similar to Jan's. Then I quit having the problem so I didn't pursue it further. That was quite some time ago. Several months. The fix coincided with some staff activity, so I figured it was fixed. For me it only effected SSL connections, but then I never connect to grex without SSL even for backtalk. I doubt I even checked to see if insecure protocols worked. At the time I decided that it was something wonky with SSL/DF/grex as I don't have any other troubles at this end.
OK, I have a clue:
I do have wireshark on my computer. I downloaded Mike's trace of his ssh
connection, and captured one myself. I hardly needed to do the comparison
because there was a fairly obvious problem in mine.
After the DNS lookup of Grex, we see the following perfectly fine packets:
1 flounder -> grex SYN
2 grex -> flounder SYN
3 flounder -> grex ACK
4 grex-> flounder Server Protocol: SSH 1.99-OpenSSH_4.2
5 flounder -> grex ACK
6 flounder -> grex Client Protocol: SSH-2.0-OpenSSH_4.4
Then I get a weirdness. Here's the full ascii dump about the seventh packet
from wireshark:
No. Time Source Destination Protocol Info
7 0.402649 216.86.77.194 192.168.2.4 TCP [TCP
Previous segment lost] ssh > 28702 [ACK] Seq=726 Ack=21 Win=17280 Len=0
TSV=1661941788 TSER=150966646
Frame 7 (66 bytes on wire, 66 bytes captured)
Ethernet II, Src: Cisco-Li_a5:65:48 (00:16:b6:a5:65:48), Dst: Micro-St_de:f5:34
(00:13:d3:de:f5:34) Internet Protocol, Src: 216.86.77.194 (216.86.77.194), Dst:
192.168.2.4 (192.168.2.4) Transmission Control Protocol, Src Port: ssh (22),
Dst Port: 28702 (28702), Seq: 726, Ack: 21, Len: 0
I'm not totally sure what this is, but the "TCP Previous segment lost" bit
on a packet sent from grex to flounder doesn't sound good to me.
In Mike's dump, at this point there was a "Server: Key Exchange Init" packet
sent from grex to flounder (number 19 in his dump).
Things keep going beyond this point though
8 flounder -> grex Client Key Exchange Init
9 grex -> flounder ACK
And then we hang. In McNalley's dump, there was an ACK sent back to Grex
after the server key exchange init packet, but that never happened with
my computer because the server key init packet got mangled. So the connection
hangs with my computer waiting for the server key init, and grex hanging
waiting for the ACK on the server key init packet it sent.
Looking through the packet sizes:
Number Size (me) Size (mcnalley)
1 74 74
2 78 78
3 66 66
4 87 87
5 66 66
6 86 106
7 66 770 <----
So it really looks like the first time Grex tries to send a larger packet,
we lose most of it.
I'm seeing many similar "Previous segment lost" packets when I try to make
other kinds of connections to Grex. I'm pretty sure these things are the
problem, but I haven't a clue what causes them.
I don't know a lot about fragmentation. I'm a bit surprised that a 770 byte packet got fragmented at all. I think mtu's are usually larger than that. But something must be "unusual" about my computer, or more computers would have trouble connecting to Grex. Is the MTU discovery working right? Don't really know either whether the packet was fragmented by Grex's computer or whether it was sent with a "may fragment" flag and fragmented by something further down the line.
So, I did "ifconfig" on both Grex and my computer, and both have MTU set at
1500. So why is a packet of size 770 being fragmented?
I did a 'tracepath grex.org' on my computer (this does path MTU discovery)
and got:
1: flounder.home (192.168.2.4) 0.205ms pmtu 1492
1: router (192.168.2.1) asymm 106 0.544ms
2: h-72-245-37-1.sfldmidn.dynamic.covad.net (72.245.37.1) asymm 1 95.463ms
3: 192.168.17.101 (192.168.17.101) asymm 2 87.163ms
4: ge-6-12-133.car2.Detroit1.Level3.net (166.90.203.1) asymm 3 83.990ms
5: ae-11-11.car1.Detroit1.Level3.net (4.69.133.245) asymm 4 80.903ms
6: ae-8-8.ebr2.Chicago1.Level3.net (4.69.133.242) asymm 5 90.831ms
7: ae-2-54.bbr2.Chicago1.Level3.net (4.68.101.97) 83.206ms
8: so-0-1-0.mp2.Detroit1.Level3.net (64.159.0.198) asymm 10 86.625ms
9: so-10-0.hsa1.Detroit1.Level3.net (4.68.115.2) asymm 8 83.993ms
10: unknown.Level3.net (63.209.134.18) 81.766ms
11: tnmi-170-200-54-69.ip.telnetww.com (69.54.200.170) asymm 10 80.879ms
12: no reply
13: no reply
14: ypsi-sfld.provide.net (216.86.64.2) asymm 13 87.456ms
15: grex.cyberspace.org (216.86.77.194) asymm 14 150.206ms
reached
Resume: pmtu 1492 hops 15 back 14
This shows a path MTU of 1492, which is pretty much what you'd expect, and
which doesn't explain anything. Maybe a tracepath to me from Grex would be
more informative, but Grex doesn't have tracepath on it and I'm not convinced
that it is worth the trouble to build.
This is way out in left field, but what kind of device are you using for NAT on your end, Jan? And how much imposition would it be to hang your machine directly on your incoming connection for a moment and give ssh a try, just to eliminate your NAT as a possible cause of the problem?
Joe Gelinas observed some similar thing with Linux at umich talking to Grex; something about TCP window optimizations in the Linux 2.6 kernel or something. I wonder if this is related....
So, since the only thing I can't reach with my 2.6 kernel linux is grex, wouldn't that make obsd's handling of window resizing non-compliant?
Maybe; it could be Linux 2.6 that is non-compliant. In either case, it seems clear there is an incompatibility. However, we need to upgrade anyway; that might fix it.
Response not possible - You must register and login before posting.
|
|
- Backtalk version 1.3.30 - Copyright 1996-2006, Jan Wolter and Steve Weiss