|
Grex > Helpers > #140: Grex System Problems - Spring 2005 | |
|
| Author |
Message |
| 25 new of 457 responses total. |
nharmon
|
|
response 214 of 457:
|
May 3 17:42 UTC 2005 |
> I wouldn't rely on it to run a bank.
With most financial institutions, security is concentrated on the perimeter,
usually because the mainframe systems that run banking software use insecure
operating systems (Windows 2000 Datacenter comes to mind).
|
tod
|
|
response 215 of 457:
|
May 3 17:50 UTC 2005 |
re #214
With most financial institutions, security is concentrated on the perimeter
Actually, security is concentrated "in depth" as in at multiple layers like
a fortress with a moat, gate, guard tower, huge wall, etc
A firewall simply doesn't cut it anymore when you have GLBA worries, IT
productivity problems, password headaches, etc..
The least you should have are 2 firewalls with different flavors at the
perimeter of a financial institution but this is not a DMZ or IPS discussion.
The fact is, Grex had a security flaw and it wasn't reported to the users.
I'm disheartened at how this and the subpoena discussions have been buried
from the public discussion.
|
marcvh
|
|
response 216 of 457:
|
May 3 17:58 UTC 2005 |
Sure, and also financial security is based on the concept of transactions
and auditability. Grex doesn't have such beasts.
|
cross
|
|
response 217 of 457:
|
May 3 18:30 UTC 2005 |
This response has been erased.
|
nharmon
|
|
response 218 of 457:
|
May 3 18:32 UTC 2005 |
Re #215 - In Grex's defense, perhaps the Coop conference is a more appropriate
place to discuss Grex policies regarding notifying users. I've posted an item
that hopefully attracts some comments on the pros/cons.
Re #216 - It used to be that Banks didn't have to care very much about their
customer's names and addresses, etc...because this data was regularly bought
and sold to other companies. But the GLBA now requires us to safeguard this
information with the utmost diligence...to the extent that some banks will
fire employees for not locking their PCs and leaving them with customer
information still on the screen.
|
steve
|
|
response 219 of 457:
|
May 3 18:39 UTC 2005 |
We use a Broadcom 5702x nic.
Grex isn't a transaction system. I will agree that such a thing presents
more of a load than Grex does, but it also has hardware better suited to that
task. We've *listened* to the disks Dan. Honestly. There were times on the
Sun-4/670 that you could just sit there and hear them madly running around.
Perhaps I didn't say it well enough but OpenBSD may be significantly different
from SunOS in this regard; maybe it will be kinder on the disks due to caching
issues. I guess we'll see.
|
tod
|
|
response 220 of 457:
|
May 3 18:49 UTC 2005 |
Maybe IDE would be kinder than SCSI?
|
naftee
|
|
response 221 of 457:
|
May 3 18:51 UTC 2005 |
i use FreeBSD and Realtek and am pleased by the performance of both.
|
steve
|
|
response 222 of 457:
|
May 3 18:56 UTC 2005 |
I don't think the disk interface matters much. However, it has occured
to me in the last few minutes that we're swimming in disk compared to what
we had under SunOS: 256M there, and 1.5G here. That will eliminate swapping
and use about 75M ram for file caching which will also help.
I just changed the default limits in /etc/login.conf for maxproc to 32.
Maxproc-max was at 128.
|
steve
|
|
response 223 of 457:
|
May 3 19:06 UTC 2005 |
Now, as for sd0 having a problem, I just mounted it and tried copying
spwd.db to /dev/null. It failed. The message in /var/log/messages is
May 3 15:00:59 grex /bsd: sd0(ahc1:0:0): Check Condition on opcode 0x28
May 3 15:00:59 grex /bsd: SENSE KEY: Media Error
May 3 15:00:59 grex /bsd: INFO FIELD: 116647
May 3 15:00:59 grex /bsd: ASC/ASCQ: Unrecovered Read Error
May 3 15:00:59 grex /bsd: FRU CODE: 0xe4
May 3 15:00:59 grex /bsd: SKSV: Actual Retry Count: 134
May 3 15:01:00 grex /bsd: sd0(ahc1:0:0): Check Condition on opcode 0x28
May 3 15:01:00 grex /bsd: SENSE KEY: Media Error
May 3 15:01:00 grex /bsd: INFO FIELD: 116647
May 3 15:01:00 grex /bsd: ASC/ASCQ: Unrecovered Read Error
May 3 15:01:00 grex /bsd: FRU CODE: 0xe4
May 3 15:01:00 grex /bsd: SKSV: Actual Retry Count: 134
There are other errors on the disk as well. When I tried to dd the
entire disk I brought the system down, the day Joe said that newuser
was failing.
Have to go back and do work work now...
|
naftee
|
|
response 224 of 457:
|
May 3 19:12 UTC 2005 |
work work work
|
steve
|
|
response 225 of 457:
|
May 3 19:15 UTC 2005 |
work plod work
|
nharmon
|
|
response 226 of 457:
|
May 3 19:29 UTC 2005 |
plod no work,... abort, retry, or ignore?
|
cross
|
|
response 227 of 457:
|
May 3 20:02 UTC 2005 |
This response has been erased.
|
steve
|
|
response 228 of 457:
|
May 3 20:11 UTC 2005 |
Dan you are sliding off into fantasy land here. THE DISK HAS PROBLEMS.
It is as simple as that. If no one else saw the errors it was because no
one looked at /var/log/messages. I will point out that you could have
rummaged around there yourself to find errors. Sigh, I don't know why
I'm bothering to respond to some of your comments, but I will say that I
think Marcus and I know the difference between the sound of bearings
and the noise a disk makes when the heads are constantly moving.
|
tod
|
|
response 229 of 457:
|
May 3 20:17 UTC 2005 |
We're not worthy.
|
gull
|
|
response 230 of 457:
|
May 3 20:28 UTC 2005 |
Re resp:211: In my (admittedly limited) experience, banks run on Windows
and proprietary mainframes. The bank I worked for had *no* Internet
connections at all, though. All branch-to-branch connections were on
leased lines.
Re resp:213: #3 is a minor issue. AFAIK there are no major bugfixes in
recent versions of Exim. While versions earlier than 4.50 do not come
with Exiscan out of the box, it's easy to patch in, and the OpenBSD port
probably already includes a flag you can toggle to include it.
FreeBSD's does.
Re resp:219: Are we swapping? Maybe we need more RAM. Even if we're
not swapping, more RAM means more disk cache. RAM is cheap.
Re resp:227: Those messages pretty clearly indicate a hardware problem,
and if they were the result of an incorrect request on the part of the
driver we'd be seeing them on the other disks, too. I think you're
really reaching to blame OpenBSD here, which is unfortunate, because it
makes this look like a matter of religion on your part instead of a
technical argument.
If you're really convinced that OpenBSD is somehow causing the illusion
of a hardware failure on this disk, I suggest connecting it to another
system running a different OS and trying to access it. That should
settle the issue.
|
steve
|
|
response 231 of 457:
|
May 3 20:32 UTC 2005 |
No, we're doing fine for ram, now. The 256M swapping situation was
on the Sun-4/670. Unless spam processing eats up Grex's hardware I
think the 1.5G we have will last for some time. Sorry I wasn't clear
on that.
|
gull
|
|
response 232 of 457:
|
May 3 20:40 UTC 2005 |
I just took a look, and we've currently got 1.2 gigabytes free, so I
think you're right. I don't know how to find out how much OpenBSD is
using as disk cache. In FreeBSD it's reported by "top", but that
doesn't seem to be the case here.
|
gull
|
|
response 233 of 457:
|
May 3 20:43 UTC 2005 |
Also, someone on staff should please read my last set of comments in the
Exim item in the Garage conf. I clear up a couple of apparent
misunderstandings in Grex's current exim.conf file, and point out some
stuff that was copied verbatim from my example and shouldn't have been.
It looks like those things still haven't been fixed.
|
steve
|
|
response 234 of 457:
|
May 3 20:52 UTC 2005 |
Want to make up a diff?
|
cross
|
|
response 235 of 457:
|
May 3 22:00 UTC 2005 |
This response has been erased.
|
steve
|
|
response 236 of 457:
|
May 3 22:39 UTC 2005 |
1) The problem with permissions on the tty was our fault, I
believe. Do you think that this was a part of the release, and
that no one ever found it? We messed up, not the OS itself. I
ask you to prove otherwise. If its a real bug in the distribution
others would have seen it.
2/3) At least some of the crashes have been due to our nic. That
code was worked on post 3.5 release. I won't say that we've not
crashed for other reasons but have we properly analyzed it? The
quota code could well have some problems. I'll bet we're pushing
it. You are right that in our current configuration we are more
disk intensive than if softupdates were on. I'm pretty sure that
the softupdate code was changed post 3.5 and in visiting the
changelog between 3.5 and 3.6, we find
"Big FFS softdep merge with FreeBSD, fixing a number of bugs."
In the changelog between 3.6 and 3.7 we find
"Fix a soft dependencies problem that caused processes to get stuck."
This was a part of some stuff from FreeBSD which wasn't complete
apparently, and was then fixed.
You know as well as I do that saying something "will be fixed
in ..." is a dangerous thing to say. I'm not going to let your
bias against OpenBSD make me say things that shouldn't be
promised. But yes, I *do* think that 3.7 is going to be a good
move for Grex, as will 3.8 and so on.
|
mcnally
|
|
response 237 of 457:
|
May 3 22:57 UTC 2005 |
> I really want to know why people take one hypothesis I propose
> (which I clearly stated was a hypothesis) and fixating on that,
My own theory is that it's due to the somewhat confrontational tone
of your messages. Even though I agree with most of your conclusions
I'll admit I'm somewhat put off by the way you've worded your responses.
Presumably it's because you feel strongly about the issue, which I
applaud.
But if Grex wants to know why volunteer staff resources are drying up,
we need look no further than the way this and other "discussions" about
the system develop. Very few people will volunteer to join a flame war
already in progress. Of course it takes more than one party to have a
good fight but whoever's responsible for the tone maybe we can all just
back off a little bit and instead of concentrating on what possibly
*should* have been done, figure out what to do now.
> Okay, so here are my major points:
>
> (1) We've had one *major* security hole in OpenBSD.
Agreed, and frankly it's a baffling one for a supposedly
secure OS. Granted the security promise offered by OpenBSD
partisans is usually "Only <n> root exploits since <time t>"
but world-readable ttys is bizarre.
> I don't think it was a configuration issue.
I'm not so sure about this -- it seems incredible to assume
that if this were really the system default it wouldn't be
very widely known, and OpenBSD rightly slammed for it.
I think when I get home tonight I'm going to have to install
OpenBSD on a spare computer and test to see whether this is,
in fact, the way things work out of the box on OpenBSD.
> (2) OpenBSD crashes quite a bit more than I or anyone else am
> comfortable with. It doesn't appear to be because of the network
> driver. It often crashes when filesystem errors, or, apparantly,
> because the proc table gets full.
Because of the time investment required to change OSes again
and the fact that we don't know for sure that FreeBSD will be
better, I'm inclined to give OpenBSD more time to prove itself
providing:
(a) we can guarantee a fix to the fork-bomb vulnerability, and
(b) we replace put another disk in place of the one that's erroring.
Of course I'm not sure I even get a vote, but that would be my
recommendation if my advice were solicited.
> (3) Our *application* is not disk intensive, but because things
> like soft metadata updates aren't reliable on OpenBSD, we're
> *making it* disk intensive. If that's chewing up drives, then
> fine, but it's not grex has such a high volume of *usage* that
> it *has* to be that way. *REAL* high volume usage
>
> Steve, you were one of the people who were adament about OpenBSD.
> Please respond to these problems. Should I believe that they'll
> be solved in the newest version of OpenBSD? Or did we make a mistake
> going with OpenBSD?
Aren't there disk-usage monitoring tools we can use to get some
sense of what's going on with our disks? And would it help
relieve thrashing if we made /tmp a 512MB RAM disk or picked
something like that?
Something's already seriously wrong if / (containing /etc) is
being written to so often that disk corruption is regularly
bringing the system down. How many things need to write to /
anyway? Newuser? What else? Virtually everything else on the
system seems like it should write to /var, /tmp, or a bbs or
homedir partition.
|
steve
|
|
response 238 of 457:
|
May 3 23:16 UTC 2005 |
sd0a went because of a general problem with it. I seriously doubt
that the runs of newuser caused this. Though /a was on sd0 and that
did have lots of i/o.
|