|
Grex > Helpers > #140: Grex System Problems - Spring 2005 | |
|
| Author |
Message |
| 25 new of 457 responses total. |
jor
|
|
response 150 of 457:
|
Apr 26 23:31 UTC 2005 |
www.yahoo.com
test me at jor48105@yahoo.com
Hints: free. Deals with crap like HTML and graphics.
Alternative: stamp your little feet and declare
yourself a "paying member" of grex,
which, I understand, accepts contributions.
|
glenda
|
|
response 151 of 457:
|
Apr 26 23:35 UTC 2005 |
Increasing the space in /var/mail will only help for only a short time. The
new amount of space will just fill up the same as the smaller space. And,
we cannot increase the space anywhere until we get the new drive and get it
up and running. There just isn't any space to allocate to it right now.
|
bru
|
|
response 152 of 457:
|
Apr 26 23:45 UTC 2005 |
I find your lack of faith...Disturbing.
Darth Grex
|
keesan
|
|
response 153 of 457:
|
Apr 27 00:31 UTC 2005 |
I tried yahoo mail and hated it. Webmail is always slow and annoying.
Eventually a systemwide spam filter will drastically reduce the amount of
unread and unwanted mail sitting in unused accounts and free up space.
What is keeping grex from installing a new drive sooner? There was something
about a new drive being saved for a new BSD installation but what would it
cost to have two new drives?
|
jep
|
|
response 154 of 457:
|
Apr 27 00:33 UTC 2005 |
M-Net seems to have enough space for e-mail.
|
mcnally
|
|
response 155 of 457:
|
Apr 27 00:35 UTC 2005 |
I believe that reducing the amount of incoming Spam will dramatically
reduce the demand for mail storage space and finding ways to block
some of it will be an important step towards a sustainable mail service.
|
paull
|
|
response 156 of 457:
|
Apr 28 19:07 UTC 2005 |
(This is actually davel, using paull's account to post this.
You'll see why in a moment.)
This is very, very strange. Grex has apparently been down now for a couple
of days. I can't connect to it at all via the internet. Grace (gracel) has
been trying to dial in. Yesterday she got no connection. Today she
was able to connect, but it kept telling her that her login was incorrect.
I was able to duplicate this. But then she tried it with Paul (paull)'s
account, & it let her in. I have, obviously, reproduced this behavior,
too.
Now comes the really weird part. Wondering if somehow Grace & I had
gotten deleted from the password file, I tried the following:
egrep '(davel)|(gracel)' /etc/passwd
davel:*:2681:1002:Dave Lovelace:/a/d/a/davel:/bin/bash
gracel:*:2731:1002:Grace Lovelace:/a/g/r/gracel:/bin/bash
But if I scan for paull (to which I'm logged in right now, remember)
or for my other son, Jon (kingjon), it finds nothing in the password
file. But I can log in as paull, even though there's no entry in /etc/passwd.
I see that there is no /a mounted (& gracel's & my home dirs in /etc/passwd
are under /a), that logged in as paull (no /etc/passwd entry) I find myself
in what looks like Paul's home directory, but it's /grex/a/p/a/paull -
and that /grex/a/d/a/davel and /grex/a/g/r/gracel do exist. I suspect
that this may have something to do with the "incorrect login" messages.
|
gelinas
|
|
response 157 of 457:
|
Apr 30 08:19 UTC 2005 |
On Wednesday, April 27, 2005, STeve Andre wrote, in a message:
} Someone did something to use up all the CPU. You can start an ssh telnet
} or ftp session but nothing ever starts so its brain dead at the moment.
Attmepts to reboot the machine failed. When we did get the machine to
reboot, we discovered that "the pw database's in a totally inconsistent
state. /etc/passwd had about 1137 entries in it, and the database files were
of different lenghts. This means the passwd system was in chaos, . . . "
(STeve, in a message dated April 29, 2005). Which is why some people could
log on, but others couldn't.
At or about 02:42 this morning (April 30, 205), STeve wrote: " Grex is up at
the moment. The damage done with the accounts can be fixed this weekend or
not, but it won't affect all the other accounts. Better to have Grex up now
for the majority of people on Grex."
There appear to be some problems with newuser, so it has been turned off
until the problems can be investigated and repaired. The web newuser program
is also disabled.
|
russ
|
|
response 158 of 457:
|
Apr 30 11:41 UTC 2005 |
As long as newuser is down anyway, it's time to splat the trolls.
I want to note here that NOBODY updated the hvcn status page nor
the main Grex page with any information whatsoever since Wednesday.
|
happyboy
|
|
response 159 of 457:
|
Apr 30 14:56 UTC 2005 |
splat the trolls = remove the ribbon
|
twenex
|
|
response 160 of 457:
|
Apr 30 20:45 UTC 2005 |
I pinged grex last night to see if it was running; apparently i left the ping
command running all day. Apologies if it caused any problems.
|
gelinas
|
|
response 161 of 457:
|
Apr 30 22:18 UTC 2005 |
It did get updated at 23:00 or so last night. I'll leave it that way a bit
longer.
|
naftee
|
|
response 162 of 457:
|
May 1 03:51 UTC 2005 |
the grex conference on m-net was updated in a satisfactory and timely manner.
|
steve
|
|
response 163 of 457:
|
May 1 05:48 UTC 2005 |
I'm baack. ;-)
I wish things had gone more evenly over the last three weeks. It's
been alternately boring and too exciting lately. I think things are
evening out, thankfully.
Grex went down on Tuesday because of what I believe to be a fork
bomb, which pretty much absorbed all the system. Some will remember
that the start of a telnet or ssh session worked and then nothing.
I think that was the damon responding to the socket connect but then
not being able to do anything else. During that time there were
people trying to create accounts, primarily via the web newuser, and
pretty much all hell broke loose internally. The new accounts were
created in the /etc/passwd file OK, but the logging of them is just
totally bizarre. There are multiple entries for many accounts,
with incorrect data associated with them. I'm still trying to sort
all that out.
Once I got over to Provide.net it became clear that the passwd file
was really messed up. Root's password wasn't what it should be, nor
were any others that I knew of. Booting Grex into single user mode
revealed an /etc/passwd file of about 1100 accounts, not the 24,000
that it should have had. Worse, the "master.passwd" main passwd file
and associated database (.db) files were messed up as well. The
master.passwd file had a different line count than passwd did, which
is horribly wrong.
Hours before I went into the hospital on April 6th I discoverd
that the reason why Grex couldn't create accounts was because of a
bad disk spot that was underneath one of the passwd database files
and prevented accounts from being updated properly. It was at that
point that I made a backup of the /etc directly. That proved very
useful.
At Provide.net I finally figured out the stuff about the weirdness
with the passwd files and copied my set over. That then let me put
Grex back into multi-user mode with an /etc/nologin file so Grex
could start processing mail. Once Glenda and I were satisfied that
the system seemed to be OK other than the passwd stuff, we left
Provide.
As an aside, Grex now lives in the Attic. This is the space onb
the second floor of Provide.net, and with its curved in walls it looks
like an attic. It's a nice facility by the way, nicely air-conditioned
and several UPS's to feed all the computers. The highest tech attic
I've yet encountered. ;-)
Anyway, it was leaving Provide that was a mistake. By bringing Grex
up and not doing anything special, I wound up destroying a good perfect
copy of the passwd file. To understand this you have to understand that
Jan wrote a very nice set of shell scripts which make backups of Grex
onto our IDE disk on the /mirror partition. This occurs every day so
we have a backup of things. It's already come in handy for me, twice.
The problem is that it only has one level of backups. The problem was
that I had the passwd stuff in place on /etc (my version from 4/6) and
the backup script ran. That then overwrote the complete copy of passwd
stuff. By the time I realized this might happen it was too late. We
had a copy of the backup passwd in /mirror.
That was stupid of me and I think I owe about 1,100 apologies to
the people whose accounts are in limbo. Their home directories are
on the disk, but with no associated entries in the passwd file.
It was then that I discovered the data in the nulogfile (the copy
of newuser runs) was crazy.
So I'm now in the process of sorting that data out and figuring out
the best way to restore accounts. If the password information in the
nulogfile is correct I think I can restore some? all? accounts. We'll
see.
After the problem on 4/6 I remembered that I needed to create the
scripts I wrote on Sunos, where copies of /etc/passwd were made every
6 hours. I did not do this, much to my shame. Had I done this none
of this passwd weirdness would have mattered. I'm going to fix that
soon, and add something, namely teach a machine at work to sftp into
Grex once a day and grab a copy of master.passwd and group, so we'll
have an off site backup of this data.
Making a few comments on stuff I've read earlier in this item:
We need to alter the size of some of the partitions on Grex. In
particular the /var/mail partiton is too small. We keep on running
into this because of the ever increasing amounts of spam we're getting
Speaking of spam, it is my *hope* that I'll be able to get back
to working with Exim and spam assasin soon, and talk with other staff
about using it here. Spam is a *complex* problem, one that has no
easy solutions, but we now have enough raw CPU power to start dealing
with it. We've not done this for far too long.
I'm not sure why we haven't ordered the new disk yet; I think
there has been some confusion over this. I hope thats settled before
long. This touches on the upgrade we need. OpenBSD 3.7 has started
to ship, so I think we can upgrade sometime in May. There are several
things that have been improved, including support for our network card,
which has caused one or two panics.
So thats it for the moment.
|
keesan
|
|
response 164 of 457:
|
May 1 15:54 UTC 2005 |
Could you set up a simple script that lets anyone who chooses to do so throw
out anything with an X-RBL warning? This would eliminate about half my spam.
I keep a log. And restore the 100K mail size limit somehow? Or let people
choose to throw out anything over that size, with the same script? People
who have tried to send me large attachments generally write me with a smaller
mail when they bounce and I explain to send elsewhere.
How old are the hard disks that grex is running on now? Can they be checked
regularly for bad spots or the likelihood of crashing? We used some program
on a disk that Scott gave us, which told us it was in imminent danger of
failing (it had already slowed down a lot).
|
drew
|
|
response 165 of 457:
|
May 1 18:03 UTC 2005 |
I've heard of such a program from, I think, Symantec. Designed to deal with
the fact that modern hard drives have circuitry which attempts to hide the
fact that some of the disk goes bad thoughout its service life by moving data
around to good sectors and lying about the actual state of the media. I forget
what it's called.
|
richard
|
|
response 166 of 457:
|
May 1 21:06 UTC 2005 |
STeve said:
"We need to alter the size of some of the partitions on Grex. In
particular the /var/mail partiton is too small. We keep on running
into this because of the ever increasing amounts of spam we're getting"
Is this not further evidence that grex needs to get out of the offsite
email business? grex should continue to offer email within the grex
site for all users, but to send or receive email outside grex, you
should have to be a paying member. Grex doesn't have the resources
anymore, if it ever did, to be a free email provider for the universe,
and too many people have and will abuse Grex with its free anonymous
email addresses, or worse use it for unethical purposes.
In fact, I'd think it a good possibility that the FBI probably has Grex
on a list of websites that could potentially be used to traffic
terrorist information, because grex gives out free anonymous email
addresses with an automated program and no verification. Do you keep
making the partitions larger and larger, and keep taking risks that
vandals or terrorists might be coming here, or do you finally
say, "enough is enough, go use hotmail or yahoo for email!"
|
steve
|
|
response 167 of 457:
|
May 1 21:56 UTC 2005 |
The disks were bought around May 2003. Checking disks for problems
is a difficult thing. There is a system that IBM delevoped for its
own disks that has failed to catch more problems than it has found,
in my usage of it. Grex munches on disks. We might want to consider
replacing them every X years, but figuring out what X should be is
interesting.
I don't see that needing to increase the size of /var as an
indication that we need to stop doing email. Disk is the one thing
that has dropped in cost over everything else. For about $250 we
could devote at 36G disk to mail and likely have enough disk for some
time. Also, with spam filtering our disk needs will slow down.
If Grex is on some government list, which I wouldn't be surprised
if true, it doesn't matter if we offer mail or not. The fact that
we're an open system is enough. This is still America and the secret
police aren't quite here yet. I'm not going to worry about it.
|
mcnally
|
|
response 168 of 457:
|
May 1 22:31 UTC 2005 |
re #166: I propose that Richard create a list of services that
he approves of or uses personally so that we can all know which
other services should be eliminated..
|
naftee
|
|
response 169 of 457:
|
May 2 01:09 UTC 2005 |
whoa, steVE! are your lungs ok ?!
|
aruba
|
|
response 170 of 457:
|
May 2 02:58 UTC 2005 |
I ordered the new disk form Leeron on Saturday. There was some confusion
about how dead the old disk is, and whether we should send it in for
warranty repair. The consensus was that we should send it in, but use the
replacement they send us back as a backup.
But, mostly, I've been draggin my heels because I've had other things to do.
So sorry for the delay. We should have a new disk within a week.
|
keesan
|
|
response 171 of 457:
|
May 2 02:59 UTC 2005 |
I have several friends who use grex ONLY for email and one of them was a paid
member (she may stop paying since she lives in Chelsea) but the others still
appreciate the email. I told them they should not feel obligated to pay for
light use. One of them also has an ISP with mail but got used to grex.
STeve, were these disks bought new in 2003 and only put into service a few
months ago? If so, would a warranty at least cover them going bad if we got
similar ones new now and they lasted under a year?
|
aruba
|
|
response 172 of 457:
|
May 2 03:02 UTC 2005 |
Re #171: Sindi - yes, the disks are warranteed by Seagate for 5 years. So
we should be able to get a replacement for the one that failed, as soon as
someone can pull it out of the machine and get it to me.
|
richard
|
|
response 173 of 457:
|
May 2 04:05 UTC 2005 |
,
|
steve
|
|
response 174 of 457:
|
May 2 04:13 UTC 2005 |
The warranty is almost irrevelant. Disks going down are a disaster for
any entity, and with Grex its even worse because of access and staff time
issues. I try to optimize on disks that have a decent record of not dying
and use those.
Replacement disks obtained from a warranty exchange make me queasy. They
are almost universally refurbished disks, meaning they came into the
manufacturer because of some problem and got "fixed". I've never liked using
this kind of disk in an intense environment, and thats exactly what Grex is.
These days we can get a 36G scsi disk for about $250, which is pretty amazing.
Thats a 15,000 rpm ultra-320 speed disk, too. Amazing.
|