Grex Helpers Conference

Item 149: Grex System Problems - Spring 2006

Entered by i on Tue Mar 21 10:23:31 2006:

73 new of 333 responses total.


#261 of 333 by gull on Wed Jun 7 06:38:38 2006:

Perhaps staff should eliminate the connection delays.  What worked well 
on the medium-sized email server I used to administer may be 
ill-advised for the volume of mail Grex handles. 


#262 of 333 by vivekm1234 on Wed Jun 7 15:19:07 2006:

Having some problem with my web interface to the BBS. Some articles don't show
up eg: in agora 184, 189, 177 178 and many more. Any idea why? AM i doing
something wrong? I use Firefox 1.5.0.4 on Win2K.


#263 of 333 by vivekm1234 on Wed Jun 7 15:23:52 2006:

Err ignore #262 Solved..
I needed to check "All items"
Perhaps that could be made more obvious!


#264 of 333 by tsty on Wed Jun 7 21:06:46 2006:

 ... but that would remove the thrill of finding NewStuff (tm) every time
you login ... <g>!


#265 of 333 by davel on Wed Jun 7 21:41:11 2006:

Re 261 re 259:
I see this all the time - mail to cyberspace.org hangs in the queue for hours
to days, often generating warnings after 4 hours.  If I connect to the SMTP
port, I see that "too many concurrent SMTP connections" message, & an
immediate disconnect.  And I've also gotten complaints about mail to my
Grex address not getting through; & this is much more of a problem for
Grace, who has no other email address.

But is this really a problem with the delay?  It seems to me that the delay
must be really, really, REALLY long if this is the case; when I see the
too-many-connections message the disconnect is *immediate*.  What
is the maximum allowable number of SMTP connections (or processes)
set to?  Can this be bumped up, if it realistically is much too low?

I realize that the limit is set to keep mail from completely busying out the
system.  But that doesn't mean that the default limit (if that's what it
is) is reasonable for Grex.  If the limit is inherited somehow from the
previous hardware, it's even less likely to be reasonable.  (But that seems
a bit unlikely, given that we went from a hacked sendmail to something else
at that time.)


#266 of 333 by mcnally on Thu Jun 8 00:37:28 2006:

 Grex's mail config sucks, but I'm afraid I don't know enough about exim
 to fix it.

 The maillog shows we're dropping connections on the floor pretty much
 constantly, but I'm not sure how to fix that.  I have a hypothesis that
 it may be due to our policy of introducing a 30s delay for any host listed
 in one of several RBLs, which I think causes a lot of tied up exim processes,
 but my attempts to reduce that delay to 1s by tweaking the exim.conf file
 have probably not been successful.


#267 of 333 by keesan on Thu Jun 8 01:50:00 2006:

Several people report that they can never send me mail, including from
mindspring/earthlink, which is impatient so it always times out.  


#268 of 333 by ball on Thu Jun 8 04:20:05 2006:

Re #267: That's what prompted me to post what I did on my
  Web site.  It's pointless people trying to send me email
  on Grex unless they happen to be another local user here.


#269 of 333 by rcurl on Thu Jun 8 15:42:11 2006:

I see what people mean about a problem with e-mail. I had sent a message here
a few days ago that arrived within minutes, but last night I sent one that
has still not arrived after nearly 12 hours.


#270 of 333 by keesan on Thu Jun 8 19:40:45 2006:

ssh did not work just now but telnet does.  Please someone remove the email
not working message from last week (motd).


#271 of 333 by krj on Thu Jun 8 21:05:58 2006:

My unscientific observation is that my incoming and outgoing mail
have stopped working for approximately the last 24 hours.
 
I realize that some of this is still my fault for not having moved
to a modern mail platform.


#272 of 333 by rcurl on Thu Jun 8 22:45:30 2006:

I have received some mail dated today, but not what I sent yesterday. Has it
gone into a black hole?


#273 of 333 by keesan on Thu Jun 8 23:34:36 2006:

I got three mails from AOL today (they have fixed the problem of refusing to
accept mails from us too).


#274 of 333 by keesan on Fri Jun 9 02:06:32 2006:

sdf.lonestar.org is working now, but freeshell.org is not, and the disk quotas
all set themselves to 0 (with 400MB free) and I had to 'tweak' them - I
wonder what people do who have not paid for the use of 'tweak'.  I posted this
info for people who signed up at freeshell.org for more reliable email than
grex, which has been astonishingly reliable recently.


#275 of 333 by rcurl on Fri Jun 9 14:28:59 2006:

After a few days with no spam, the dam broke this morning and 21 spam messages
poured in. But the message I sent two days ago has not. Where is it?


#276 of 333 by keesan on Fri Jun 9 15:53:15 2006:

Your message must have drowned in the flood of spam.


#277 of 333 by davel on Fri Jun 9 16:16:26 2006:

I also am seeing much more spam.  OTOH, SMTP connections are not being
closed (or much, much less frequently) with that too-many-smtp-connections
message.  I suspect no mere coincidence.


#278 of 333 by mcnally on Fri Jun 9 16:25:06 2006:

 In an attempt to end the backlogs that were causing many valid
 messages to be dropped, I disabled two of the three spam blacklist
 checks to see whether that would improve mail delivery.  Apparently
 it hasn't had any affect on the delay problems people are complaining
 about, only facilitated the delivery of more spam.

 I'll put back the copy of the exim configuration I kept from before
 my changes, restoring the status ante.

 I apologize for the extra spam; it was an experiment to see whether
 some tuning of the system would help the situation; unfortunately I
 don't know enough about exim to figure out how to increase the
 number of simultaneous connections it will accept.  I *do* know that
 the mail log is full of dropped connections, constantly, and it 
 would be nice if whoever committed us to exim would look at them.


#279 of 333 by keesan on Fri Jun 9 16:33:29 2006:

Mike, could you set up some simple script that would let people use
spamassassin?  I have never ever had a false positive (I only require three
points to dump suspected mail) but still sometimes get false negatives.
Spamassassin with 3 points was getting at least 3/4 of my spam.  (I added some
more filters on top of it).  Unfortunately it puts some large files into a
./.spamassassin directory but they could be deleted automatically at login.


#280 of 333 by rcurl on Fri Jun 9 16:38:18 2006:

Re #278 - so my message to me here was killed by a Grex spam filter? From 
a UM server address? Why would any of those be in a spam blacklist?


#281 of 333 by cyklone on Fri Jun 9 16:58:55 2006:

In case you are unaware, some ISPs, etc, filter HARD on umich.edu. I know 
from experience because several times in the last year or so I've had that 
problem with another ISP I use blocking the umich mail. There reasons were 
quite understandable. Apparently, a lot of spam or other problems involve 
umich.edu addresses. I was once told it had to do with all the freebies 
the students download using their umich accounts (presumably a reference 
to the hidden "zombieware" some freebies contain) though I don't know the 
details. In any case, if you haven't learned already, be warned now: a 
umich.edu address is apt to be filtered by any number if ISPs for any 
number of reasons. I'd suggest an alternate address for time-critical 
communications.


#282 of 333 by keesan on Sat Jun 10 12:18:18 2006:

I cannot send mail from grex or even postpone it.


#283 of 333 by keesan on Sat Jun 10 12:52:43 2006:

Sdf.lonestar.org is also inaccessible again.  When I try to telnet
it gets stuck - how do I exit from the attempt (in DOS, Ctrl-C
does not work)?


#284 of 333 by lorance on Sat Jun 10 17:35:38 2006:

I've never used telnet under DOS, but in UNIX if you enter Control-]
you should get a prompt. Enter q and press enter and you should quit.
If you don't get the prompt then I have no idea.


#285 of 333 by keesan on Sat Jun 10 18:01:49 2006:

Thanks.  First time today sdf at freeshell was 'down', an hour later
it had been up for over 3 days.  ???


#286 of 333 by davel on Sat Jun 10 18:13:28 2006:

I'm not able to get any mail at all into Grex.


#287 of 333 by keesan on Sat Jun 10 18:16:10 2006:

Ctrl-] works when I am ssh'ed to grex, but what I need is a way to
end a telnet attempt FROM grex to freeshell/lonestar.  


#288 of 333 by mcnally on Sun Jun 11 07:42:02 2006:

 re #280: 
 > so my message to me here was killed by a Grex spam filter?

 I have no way of knowing that, but probably not.  Delaying messages
 from sites that are believed to be spam sites potentially affects
 delivery of messages from all sites.  Let's say that Grex is configured
 to support N simultaneous mail connections at any given time.  Now imagine
 a whole bunch of sites that are listed in these RBLs connect and attempt 
 to deliver messages.  Because they're listed in the RBLs their connections
 are intentionally delayed to slow down spam delivery.  What happens if
 N of these sites are being kept waiting while your non-blacklisted mail
 site attempts to make the N+1th connection?  

 If I understand the system properly your connection, the N+1th, 
 is rejected because the mail server is busy and it's assumed that
 the host trying to deliver it will reconnect later when the Grex
 server isn't busy.  But from what I see in the log files I think
 we're being more or less constantly bombarded with connections
 from other hosts and there's never a time when Grex's server is not
 busy and dropping connections from other hosts that want to connect.

 I really hope I'm misunderstanding something fundamental here
 but whether I am or not something is clearly very wrong with mail
 delivery.


#289 of 333 by keesan on Sun Jun 11 12:51:53 2006:

Grex is not accepting mail for the last day or two, and yesterday (and
probably today) was not sending mail either.  What is the problem and is
anyone working on it?   I have a couple of craigslist ads listing my grex
address, because freeshell was broken at the time.


#290 of 333 by keesan on Mon Jun 12 01:36:20 2006:

fastmail.fm offers 10MB webmail without ads.  Login at least 8 characters.


#291 of 333 by rcurl on Mon Jun 12 02:50:32 2006:

When are we going to get e-mail back on Grex? Or is now the time to "jump
ship" from using Grex for e-mail?


#292 of 333 by nharmon on Mon Jun 12 11:42:13 2006:

Some of us "jumped ship" a long time ago.


#293 of 333 by slynne on Mon Jun 12 12:50:32 2006:

As much as I hate to say this, I dont think that grex currently has the 
staff needed to maintain email up to the standards we all would like. 
This is especially true since there are so many excellent free email 
services out there (like gmail). 


#294 of 333 by davel on Mon Jun 12 14:27:32 2006:

I tried to offer a gmail invitation to Grace.  They supposedly sent her
email containing the invitation.  But  since mail to Grex isn't going through,
she never got it.

AFAICS Grex is just plain not accepting mail at present.  Always that same
too-many-SMTP-connections message.


#295 of 333 by keesan on Mon Jun 12 15:46:52 2006:

Last I knew grex was not sending outgoing email either.  I don't know any
other place besides freeshell to get a shell account that will let us use
non webmail with mail, mutt, or pine, and set up spamassassin and procmail.
Is anyone working on this problem?


#296 of 333 by krj on Mon Jun 12 17:01:16 2006:

Rane in resp:291 ::  It was probably time to leave Grex's email service
about a year ago.   I'm still doing too much mail here, unsuccessfully, 
too. 
 
Dave in resp:294 on Gmail invites:  Have grace get a hotmail account, then
send the Gmail invite to the hotmail account?
 
Sindi in resp:295 :: There are probably good reasons there are 
very few public shell accounts with e-mail any more.  Email has become
a very difficult and hostile environment.  There is little reason to 
expect volunteers to work like beavers to give you reliable 1985-style
e-mail any more.


#297 of 333 by mcnally on Mon Jun 12 17:05:24 2006:

 Here's the deal.  I'm out of town, on the first vacation I've had in 
 quite a while.  As much as it annoys me (I conduct most of my personal
 e-mail through Grex and there's no telling what I'm missing, same as
 many of the rest of you..) I'm not planning on spending my vacation
 learning how to administer exim properly and fixing Grex's e-mail.

 Unfortunately nobody else from staff seems to be responding to, 
 or possibly even reading this conference posts.  

 When I get back I'm willing to take a shot at getting a mail configuration
 working on Grex, but if I do it I'd prefer to use postfix, a mailer I'm
 more familiar with.  Also, there may be a period when mail doesn't work
 at all while things are swapped over.  If people can live with that I'll
 give it a try when I get home.  


#298 of 333 by remmers on Mon Jun 12 17:33:13 2006:

Well, I'm at least reading this item.  Don't know that I'll have time to
work on exim either.  (Mail server configuration is in general something
I haven't had a lot of experience with.)

As I recall the history, we're using exim because a staff member at the
time we were transitioning to openbsd was intimately familiar with it
from work and volunteered to set it up.  Unfortunately, for reasons that
I think were beyond his control, he's no longer an active staff member,
so we no longer have a resident exim expert.

I think that if you're willing to work on mail configuration and nobody
else is, the mail software we use should be your call, so I would
support switching to something you'd be more comfortable with.


#299 of 333 by keesan on Mon Jun 12 19:16:03 2006:

Thanks Mike and John.  Perhaps I should start an agora item in which we can
post messages for other grexers.  In the meantime I won't use the grex email
address for craigslist postings.  


#300 of 333 by naftee on Mon Jun 12 20:27:01 2006:

what on earth would sindii post on craigslist ? 

does she write about GreX twits in the rants & raves section ?


#301 of 333 by cross on Mon Jun 12 20:56:03 2006:

This response has been erased.



#302 of 333 by gull on Tue Jun 13 03:37:57 2006:

The number of simultaneous connections is governed by the 
'smtp_accept_max' variable.  Unfortunately the Exim documentation 
doesn't mention what the default setting of this variable is.

A couple other suggestions to streamline the process:
- Currently you wait up to 30 seconds for an ident response.  This 
happens during the connect phase.  I'd shorten that.  Set 
"rfc_1413_query_timeout" to something in the 10 second range, if you 
absolutely can't live without it.  I find ident isn't very useful these 
days, so I tend to disable that lookup entirely.

- You have 'deliver_queue_load_max' set to 1.0, which means you aren't 
doing any deliveries when the load average is above that value.  I 
might bump that up to 2.0, since Grex seems to run pretty high load 
averages at times.  This may be part of the reason you aren't getting 
much mail through.  This is a tricky one; you're potentially trading 
off system responsiveness for getting more mail through.



If you've got more people who are familiar with postfix, switching 
might be a good idea.  MTAs are funny that way; they're complex 
programs and people who know one generally find any others to be 
incomprehensible.  For example, I know Exim reasonably well, but I'd be 
lost with Postfix, and I can just barely get Sendmail going.


#303 of 333 by nharmon on Tue Jun 13 12:01:40 2006:

I'm familiar with exchange, but not out of choice.


#304 of 333 by keesan on Tue Jun 13 14:18:14 2006:

I could never get mail from earthlink/mindspring, it always timed out (they
did not wait long enough).  Is that related to the previous response?


#305 of 333 by mcnally on Tue Jun 13 18:36:33 2006:

 re #302:  I tried bumping up the load average limits for queueing and
 delivering mail to 2.5 or 3 or something like that before I also tried
 cutting out several of the RBL checks.  It didn't appear to make any
 significant improvement in delivery or in the number of messages dropped.

 If you're knowledgable about exim configs you could have a look at the
 configs, suggest specific changes (maybe edit a copy, I'll look at the
 diffs, and apply them) if you have the time..


#306 of 333 by davel on Tue Jun 13 20:01:52 2006:

Re 297 (way back):
Mike, I certainly didn't mean to be dumping on you.  My experience over the
years has been that you're willing to help people whenever you can,
& I'm confident that you approach staff stuff the same way.  I know I
wasn't the only one complaining; but I for one wasn't pointing any fingers.

Meanwhile, it sure looks like someone did *something*.  The dozen queued-up
messages that had been trying to go to Grex for several days now seem to have
suddenly vanished from the queue - & as I haven's gotten any bounce messages,
I think they must have gone through.  But a new message I just sent seems
to be hanging there.  Dunno.


#307 of 333 by keesan on Tue Jun 13 21:12:32 2006:

I just tried to send myself a mail with pine.  I did not get the usual error
messages but it took a long time to go from 0% to 100% sent and then got stuck
at 100%.  Ctrl-C exited pine and told me I had sent the mail (after I waited
a few minutes for the prompt) but the mail has not arrived.


#308 of 333 by cross on Tue Jun 13 21:23:20 2006:

This response has been erased.



#309 of 333 by keesan on Tue Jun 13 22:04:39 2006:

finger root -- Charlie Root, mail last read June 12, logged in from msu.edu,
running sh which is using 14% of CPU, causing load average to be around 8.
Amazing that I can still see what I am typing.  sh is a shell.


#310 of 333 by keesan on Tue Jun 13 22:06:31 2006:

sh is now up to 16% of CPU time.  It was 13% when I first looked.
Another hole to be plugged?

166 processes, of which 1 is running and the rest are stopped.


#311 of 333 by keesan on Tue Jun 13 22:07:36 2006:

Make that the rest are idle or zombie, 1 on processor, and sh is 15%.


#312 of 333 by keesan on Tue Jun 13 22:10:55 2006:

Charlie Root has been running ssh since approximately 5:47.  jp2 has been
logged on since 5:38 and is running 'j p 2'.  The rest of us are running bbs,
party, lynx, bash, and the like.  


#313 of 333 by cross on Wed Jun 14 00:12:18 2006:

This response has been erased.



#314 of 333 by gull on Wed Jun 14 00:25:59 2006:

True, it's much better to log in as someone else and the su to root.  
OpenBSD doesn't nag about it just to be contrary. ;)


#315 of 333 by keesan on Wed Jun 14 03:00:31 2006:

I just got my first mail sent to grex in about a week, and was able to answer
it (I hope) in under 30 sec, but I still never got the mail I sent myself
earlier.  Thanks if someone fixed something.


#316 of 333 by keesan on Wed Jun 14 03:16:34 2006:

Load average is back down from 9 to .4.  How would we mortals figure out what
is causing the load average to go up so we can report it to staff?


#317 of 333 by mcnally on Wed Jun 14 08:16:18 2006:

 One of the things which is causing mail failures, I believe,
 (but not the only one) is that periodically /var/spool runs out
 of free inodes.  It runs out of free inodes because there's a
 totally absurd number of files in several subdirectories of
 /var/spool/exim which never, ever seem to get cleared out.

 Can someone who's familiar with exim tell me what purpose the
 various subdirectories of /var/spool/exim serve and how files
 in there are supposed to be purged?  Because we don't seem to
 be doing it properly..


#318 of 333 by keesan on Wed Jun 14 21:52:57 2006:

I am still getting mail from June 8 - better late than never.  Thanks again
STeve et al.


#319 of 333 by cross on Wed Jun 14 23:18:41 2006:

This response has been erased.



#320 of 333 by keesan on Thu Jun 15 00:48:27 2006:

I was able to reply to a few June 7 mails that just showed up but now I am
getting an STMP greeting failure when I try to send mail.  We picked the wrong
week to try to sell a car.  Turns out lots of people were interested a week
ago.


#321 of 333 by keesan on Thu Jun 15 01:22:13 2006:

Today I have been getting mail progressively from June 9, June 8, June 7 and
now June 6.  I wondered why people had stopped writing.  Thanks again.


#322 of 333 by keesan on Thu Jun 15 01:56:16 2006:

In among the June 7 I just got a May 23 mail!  Rane was saying it took up to
12 days to get mail at grex, but this is about 21 days.  How does the mail
manage to get so bogged down?  Can exim be instructed to deliver the oldest
mails first instead of the newest ones?


#323 of 333 by steve on Thu Jun 15 04:09:50 2006:

   There are some old pieces of mail in the queue which I think will now
get liberated.  Exim is likely going to get a little more tweaking
before this is all over.  But mails are moving fairly quickly now.
It took about 1.5 minutes for mail to get from msu.edu to grex, and
faster than that the other way around.


#324 of 333 by rcurl on Thu Jun 15 07:57:19 2006:

Wow! The e-mail dam has burst. But, thanks!


#325 of 333 by gull on Fri Jun 16 23:57:27 2006:

Re resp:317: I don't have the permissions to see what's 
in /var/spool/exim here, but on my own machine I see four directories.

db/ - This only has four files in it, on my system.  I think it 
contains Exim's retry database.

input/ - This holds the message queue.  If I remember right there are 
two files per message, one containing just the headers and one 
containing the message body.  If you've got a lot of crap in there, you 
need to figure out why messages are stacking up in the queue.  A common 
culprit is undeliverable bounce messages, which Exim "freezes" and 
keeps in the queue until the time set in the 
"ignore_bounce_errors_after" option is reached.  You want this time 
*short* because undeliverable bounces are almost always the worthless 
backscatter from spam runs.  Right now Grex has this set to 2 days, 
which I'd say is on the long side.

msglog/ - This holds log information about messages that are in 
transit.  The files here (one per message) are normally deleted once 
the message is delivered, although sometimes they get missed and 
linger.  Again, if this is full, it might be because you have a long 
queue.  This information is duplicated in the main log, so it's safe to 
say "message_logs=false" in the beginning part of exim.conf and delete 
the contents of this directory.  This should help the inode problem.

scan/ - This is a temporary directory where messages are unpacked while 
they're scanned by external software like ClamAV or SpamAssassin.  I 
don't think Grex is running either of those programs, so there 
shouldn't be anything in there.


#326 of 333 by mcnally on Sat Jun 17 00:18:24 2006:

 The msglog directory has tens, or possibly hundreds, of thousands of
 files, some of them dating back to 2004.

 I presume those aren't from messages Grex is still attempting to deliver..


#327 of 333 by gull on Sat Jun 17 00:39:09 2006:

Remember that Grex crashed a lot, for a while.  Most likely Exim never 
got to delete those files due to a crash, or perhaps there's a bug that 
causes them to occasionally escape deletion.

Like I said, they're safe to delete, and if you set 
"message_logs=false" you shouldn't have to deal with them anymore.  
They're really only useful for troubleshooting delivery problems, and 
the same info can be gleaned from the other logs with a bit of effort.


#328 of 333 by cross on Mon Jun 19 00:18:08 2006:

This response has been erased.



#329 of 333 by gull on Tue Jun 20 20:35:07 2006:

I'm not arguing with that.  I've got four old message log files on my 
own small server, and I have no idea why.  Frankly, I think the whole 
individual-message-log thing is a misfeature that should be removed, or 
turned off by default.

UNIX software in general doesn't seem to clean up after itself very 
well.  If it did there wouldn't be a need for periodic /tmp cleaning 
scripts to get rid of old, stale lockfiles and the like.


#330 of 333 by eprom on Wed Jun 21 23:43:39 2006:

its officially summer


#331 of 333 by keesan on Thu Jun 22 01:29:36 2006:

I could not dial either number just now and connect - they both just rang.


#332 of 333 by naftee on Thu Jun 22 02:10:36 2006:

what !

m-net seems to be down :(


#333 of 333 by scholar on Thu Jun 22 02:13:23 2006:

 :(

plz fix


There are no more items selected.

You have several choices: